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Preface 


The solution of linear systems of equations Ax = b is a cornerstone of computa- 
tional science and engineering. Being able to solve linear systems in a reliable and 
efficient way is of great importance and interest not only to scientists and engineers 
but also to a huge and varied community of people who are unaware that at the heart 
of the software they are using lies a linear equation solver and that this is key to 
its feasibility and performance. In many applications, the linear systems that must 
be solved are large and square and they are sparse (that is, many of the entries 
in the system matrix A are zero). Direct methods for solving such systems are 
characterized by computing a factorization (or decomposition) of A into a product 
of much simpler matrices in such a way that solving systems of equations with 
these matrices is easy and inexpensive. For example, A may be factorized into a 
product of triangular matrices; in principle, solving a linear system in which the 
system matrix is triangular is straightforward. Direct methods obtain the solution 
to the linear system in a finite and fixed number of steps that is independent of A 
and b. Because of rounding errors, the computed solution is generally not equal to 
the exact one but, if a direct method is well implemented, the resulting software is 
extremely robust and can be used as a “black box solver”, with the user not needing 
any detailed knowledge or understanding of what is going on within the box. 

By contrast, an iterative method (sometimes also called an indirect method) 
generally involves an unknown number of steps and its performance is highly 
problem dependent. In many cases, for the method to converge to the sought-after 
solution of the linear system, it is necessary to use a preconditioner. This has to 
be tailored to the system being solved. The aim is to transform the linear system 
into one with more favourable numerical properties so that, when applied to the 
transformed system, the iterative solver converges to a solution of the requested 
accuracy in an acceptable number of steps. The major advantage of iterative solvers 
over direct ones is that they require very little memory and, once the preconditioner 
has been constructed, most of the computational work is in the application of the 
preconditioner and matrix-vector products with A. For extremely large problems 
(for example, systems coming from discretizations of real-world three- or four- 
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dimensional problems), memory requirements prohibit the use of direct methods, 
and without suitable iterative methods the systems would be intractable. 

This book presents classical techniques for matrix factorizations based on 
variants of Gaussian elimination that are used in sparse direct methods and discusses 
the construction of approximate direct and inverse factorizations that are key to 
developing algebraic preconditioners for use with iterative solvers. While a number 
of books on iterative solvers discuss the construction of simple incomplete matrix 
factorizations for use as preconditioners, very few attempt to unite the fields of 
complete and incomplete factorizations or cover contemporary approaches. To 
achieve this broad view, we use a single framework that emphasizes the underlying 
sparsity structures and highlights the importance of understanding sparse direct 
techniques when building algebraic preconditioners. 

The book is algorithmically oriented, presenting computational schemes that 
are designed to provide both an understanding of sophisticated sparse factorization 
techniques and how they can be implemented in practice. Throughout, we include 
outline algorithmic descriptions and use pseudocode that is independent of any 
programming language. However, limitations on space mean that it is beyond the 
scope of the book to discuss the complex implementation details that are needed 
in the development of high-quality sophisticated (parallel) production software for 
efficiently solving sparse linear systems using modern computer architectures. 

The book is aimed at students of applied mathematics and scientific computing as 
well as at computational scientists and software developers interested in understand- 
ing the theory and algorithms needed to tackle the challenge of solving large-scale 
linear systems. The presented treatment is intended to be largely self-contained, 
and we assume only that the reader has a basic knowledge of linear algebra and 
numerical mathematics. 

The organization of the book is as follows. Chapter 1 provides a general 
introduction to sparse matrices and the challenges of solving large sparse linear 
systems of equations. Concepts from graph theory that are used in the development 
of sparse matrix algorithms are recalled in Chapter 2. The material in Chapters 1 
and 2 is rather elementary, but it serves to remind the reader of important ideas and 
to introduce the notation and terminology that is used throughout the rest of the 
book. An introduction to sparse matrix factorizations, including the use of block 
forms, is given in Chapter 3. Then, in Chapters 4 and 5, the symbolic and numerical 
factorization phases of sparse Cholesky methods for solving the important class of 
symmetric positive definite linear systems are discussed. Sparse LU factorizations 
for general nonsymmetric sparse systems are described in Chapter 6. Chapter 7 is 
devoted to stability and pivoting strategies and includes a discussion of factorizing 
sparse symmetric indefinite systems. Sparse matrix ordering algorithms that are 
essential for the efficiency of sparse solvers are presented in Chapter 8. 

The final three chapters of the book switch attention from direct methods to 
the study of algebraic preconditioners for use with iterative solvers. The emphasis 
is on employing and adapting ideas and concepts used by direct solvers in the 
development of effective general classes of preconditioners that can be used 
for tackling a wide range of problems, without relying on detailed knowledge 
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of the properties of the underlying application. Chapter 9 introduces algebraic 
preconditioners and approximate factorizations. Chapters 10 and 11 then focus on 
two key classes of algebraic preconditioners: incomplete factorizations and sparse 
approximation inverse preconditioners. 

We do not attempt to cite all the vast array of publications related to sparse 
direct methods and algebraic preconditioners. Furthermore, we do not include 
proofs for all the theoretical results that we present. Rather, for each theorem, 
we provide one or more citations to where the reader can find a proof and/or 
get a better understanding of the result. In general, we include citations to the 
original paper/book/report (or a textbook for standard results) and, in some cases, an 
additional citation that is either more accessible or presents an alternative proof. In 
addition, at the end of each chapter, we have a short section of notes with references 
to key publications that give a historical perspective and/or provide further reading. 
It is interesting to note that a Google Scholar search in July 2022 for the term 
“sparse matrix” lists more than 2.7 million results, while a search for “sparse matrix 
decompositions” gives in excess of a million results. Although the majority may 
not be relevant to our areas of interest, it does indicate the wealth of the available 
literature as well as the importance of sparse matrix algorithms and their widespread 
use. 

This monograph and its study of sparse linear systems represents a natural 
extension of our successful long-term research collaboration, combined with the 
research and the software development projects that we have each worked on 
with other researchers. Past and present colleagues at the Rutherford Appleton 
Laboratory that Jennifer would particularly like to acknowledge and thank for many 
years of collaborations and enjoyable coffee time chats are Iain Duff, Nick Gould, 
Jonathan Hogg, Yifan Hu, Tyrone Rees, and John Reid. Miroslav would like to 
express his thanks to his first major collaborator Michele Benzi, from whom he 
learnt a lot, to Ivan Némec, who invited him to work on codes that are now in 
the RFEM Structural Analysis and Engineering Software, and to his colleagues 
and friends in Prague, especially Zdeněk Strakoš, Miro RozloZnik, Josef Malek, 
Petr Tichý, and Iveta Hnétynkova, who created a kind and productive working 
environment. 

We are very grateful to Hussam Al Daas, Jonathan Hogg, and Gerard Meurant 
for reading and commenting on all or part of a draft of the book. They spotted errors 
and made suggestions that led to important improvements; we really appreciate the 
time they spent doing this for us. We would also like to thank our institutions for 
opportunities to spend time in Prague, the Rutherford Appleton Laboratory and 
Reading working on our joint research projects. Jennifer would like to acknowledge 
funding over the last 30 years from the Science and Technology Facilities Council 
and the Engineering and Physical Sciences Research Council. And we are extremely 
grateful to the University of Reading for providing the funding that allows this book 
to be published as open access. 

And, finally, we each owe a huge debt of gratitude to our families. Jennifer wishes 
to dedicate the book to her close family, both those who are no longer with us and 
those who continue to be an important part of her life, and most especially Stewart, 
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Emma, Simon, Mark, and Rebecca for their constant encouragement. Miroslav 
would like to dedicate the book to the memory of his ever-supportive parents and 
to thank Anna, Markéta and Martin, who have always tolerated his passion for 
research. 


Harwell and Reading, UK Jennifer Scott 
Prague, Czech Republic Miroslav Tůma 
August 2022 
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Notation: Quick Reference Summary 


Notational Conventions Used for Matrices and Vectors 


Capital italic letters, e.g. A, L, P 
Uppercase calligraphic letters e.g. Z, S 
Lower case non-integer italic letters, e.g. p, x, y 


Lower case integer italic letters, e.g. i, j 
Lower case Greek italic letters, e.g. a, B, ui 
Subscripted lower case non-integer italic 
letters, e.g. Xj, Xi:j 


Double subscripted lower case italic letters, e.g. aj; 


Double subscripted bracketed upper case 
italic letters, e.g. (A)j; 


Matrices 

Sets containing indices 
Vectors (may also denote a 
scalar or function but this 
will be clear from the 
context) 

Integer scalars 

Real scalars 

Vector entries, e.g. entry i 
and entries i to j of x 

x; may also denote column i 
of matrix X 

Entry in row i, column j of 
matrix A 

Entry in row 7, column j of 
matrix A 

(alternative notation for ajj) 


Different forms of double subscripted upper case italic letters: 


Aib, jb Sub-block of matrix A in position 
(ib, jb) 

Ai: Or Åi 1:n Row i of matrix A (with n columns) 

A: j OF Aq:n, j Column j of matrix A (with n rows) 

Ai: jk Submatrix comprising rows i to j, 
column k 

Aj: ikl Submatrix comprising rows i to j, 


columns k to l 
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xvi Notation: Quick Reference Summary 


Aj = Atjasj Principal leading submatrix of A of 
order j 

AT, J Submatrix of A with row and column 
indices in sets Z and J, respectively 

Ai T Entries in row i of A with column 
indices in set 7 

AZ, j Entries in column j of A with row 


indices in set Z 
Lower case italic letters with superscript Value of x at iteration k 
in brackets, e.g. x) 
Upper case italic letters with superscript Matrix A at iteration k 
in brackets, e.g. A”? 


Notational Conventions Used When Discussing Graphs 


G = (V,€) Graph with vertices V and edges € 
G(A) = (V(A), E(A)) Adjacency graph of matrix A 
adjg{v} Adjacency set of vertex v 
G-'(A) Skeleton graph of matrix A 
k k-th elimination graph 
glk k-th quotient graph 
T(A) or T Elimination tree of matrix A 
0) i-th row subtree of T 
T(J) Subtree of 7 rooted at vertex j 


Lower case italic letters, e.g. i, j, u,v Graph vertices 


The following are for an undirected graph G: 


ig <> i] 4> ... <> ip-1 <> ip Sequence of undirected edges in G 

(i PERR j) or (i <—> j) or (i, j) Undirected edge between i and j in G 

ib joies j Path from i to j in G 

i & j or i 4> j All intermediate vertices on the path are 
a A less than min{i, j} (fill-path) 

i = joi a. j All intermediate vertices on the path 


belong to V, 
The following are for a digraph (directed graph) G: 


ig —> ii —> ... — ip-1 — ip Sequence of directed edges in G 
(i S j or (i — j) Directed edge from i to j in G 


is j o i= j Path between i and j in G 


Notation: Quick Reference Summary xvii 


i BEAN j or i => j All intermediate vertices on the path are less 
min min 
than min{i, j} (fill-path) 
i == joi r j All intermediate vertices on the path belong 


to V; 


Specific Variables and Matrices That Are Used Throughout 


A, x,b 


Sc 

S{A} (S{v}) 
band(A) 
env(A) 

e 

ei 

f 

n 

nz(A) 

ib, jb, kb, lb 
nb 

| Alle 

(x, Y)A 
Ilx|lA 

lx l2 

k (C) 

p(C) 

Amin (C), Amax (C) 
Pgrowth 

E 

Ø 


The system matrix, solution vector, and right-hand-side vector 
(Ax = b) 

The transpose of matrix A 

Diagonal matrix with 1 x 1 (and possibly 2 x 2) blocks on the 
diagonal 

Diagonal and strictly lower and upper triangular parts of A 
Identity matrix (of order n) 

Lower and upper triangular matrices; matrix factors 
Approximate matrix factors 

Preconditioner 

Row and column permutation matrices 

Row and column scaling matrices 

Sparsity pattern of matrix A (vector v) 

The band of a symmetrically structured matrix A 

The envelope of a symmetrically structured matrix A 
Vector of all ones 

i-th column of the identity matrix 

Filled entry in a matrix factor 

Order of A 

Number of nonzero entries in A 

Subscripts denoting blocks in (e.g.) A or L 

Number of row (and column) blocks of A in block form 
Frobenius norm of matrix A 

A-inner product of vectors x and y, that is, x7 Ay 
Corresponding A-norm of vector x, that is, (x7 Ax)!/? 
2-norm of vector x 

Condition number of a matrix C 

Spectral radius of a matrix C 

Eigenvalues of C of smallest and largest absolute value 
Growth factor 

Machine precision 

An empty set (one with no entries) 


Abbreviations 


AINV 
DAG 

FSAI 

IC/ ILU 
MIC/MILU 
PDE 
SAINV 
SPAI 

SPD 


Factorized approximate inverse 

Directed acyclic graph 

Factorized sparse approximate inverse 
Incomplete Cholesky/LU factorization 
Modified incomplete Cholesky/LU factorization 
Partial differential equation 

Stabilized factorized approximate inverse 
Sparse approximate inverse 

Symmetric positive definite 


Chapter 1 A 
An Introduction to Sparse Matrices TRICA 


Let us begin with a few words about the subject itself. What are 
all these research workers trying to do? Mostly, they are trying 
to solve Ax = b . . . Amazing. Can people still find something 
new to say on these corny old subjects? The answer is yes ... It 
is the pressure to solve bigger and more complex problems that 
has led people to return again and again to look in 
ever-increasing detail at such basic tools as a linear equations 
solver — Parlett (1974). 


We may therefore interpret the elimination method as ... the 
combination of two tricks: First, it decomposes A into a product 
of two [triangular] matrices ... [and second] it forms their 
inverses by a simple, explicit, inductive process — Von Neumann 
& Goldstine (1947) 


1.1 Motivation 


Consider the simple matrix A on the left in Figure 1.1. Many of its entries are zero 
(and so are omitted). This is an example of a sparse matrix. The problem we are 
interested in is that of solving linear systems of equations Ax = b, where the square 
sparse matrix A and the vector b are given and the solution vector x is required. Such 
systems arise in a huge range of practical applications, including in areas as diverse 
as quantum chemistry, computer graphics, computational fluid dynamics, power 
networks, machine learning, and optimization. The list is endless and constantly 
growing, together with the sizes of the systems. For efficiency and to enable large 
systems to be solved, the sparsity of A must be exploited and operations with the 
zero entries avoided. To achieve this, sophisticated algorithms are required. 

The majority of algorithms fall into two main categories: direct methods and iter- 
ative methods. Direct methods transform A using a finite sequence of elementary 
transformations into a product of simpler sparse matrices in such a way that solving 
linear systems of equations with these factor matrices is comparatively easy and 
inexpensive. For example, if A is symmetric, consider the Cholesky factorization 
A = LL’, where the factor L is a lower triangular matrix (and the superscript 
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Figure 1.1 The locations of the nonzero entries in a sparse matrix from structural engineering 
(left) and in L + LT (right), where L is its Cholesky factor. 


Figure 1.2 The locations of the nonzero entries in a symmetric permutation of the matrix from 
Figure 1.1 (left) and in L + L" (right), where L is the Cholesky factor of the permuted matrix. 


LT denotes the transpose of L). Solving linear systems with a triangular matrix 
is generally cheaper and more straightforward than for a general matrix. For the 
matrix in Figure 1.1, it is clear that L has filled in, that is, compared to A, it has 
more nonzero entries. If the amount of fill-in is too high, then the advantages of 
having a triangular matrix will be lost. An important question is: can we permute 
the rows and columns of A so as to reduce the fill-in in its factor L? One possibility 
is shown in Figure 1.2. Here A has been symmetrically permuted to give a matrix 
that has a much sparser factorization LLT. 

Having fewer entries in L reduces both the required storage and the number of 
operations that are needed to compute it and that must be performed when using 
it. This simple example suggests other possible questions, such as: how can the 
positions of the nonzero entries in A and in its factors be described? How can the 
sparsity pattern of the factors be determined from that of A? What influences the 
computational efficiency of matrix factorizations and other matrix transformations 
on contemporary computers? 

Direct methods built on matrix factorizations are designed to be robust so 
that, properly implemented, they can be confidently used as black-box solvers for 
computing solutions with predictable accuracy. However, they can be expensive, 
requiring large amounts of memory, which increases with the size of A. By contrast, 
iterative methods compute a sequence of approximations 


1.2 Introductory Terminology and Concepts 3 


that (hopefully) converge to the solution x of the linear system in an acceptable 
number of iterations. The number of iterations depends on the initial guess x, A 
and b as well as the accuracy that is wanted in x. Iterative methods use the matrix 
A only indirectly, through matrix—vector products, and their memory requirements 
are limited to a (small) number of vectors of length the order of A, making them 
attractive for very large problems and problems where A is not available explicitly. 
They can be terminated as soon as the required accuracy in the computed solution is 
achieved. Unfortunately, frequently convergence does not happen or the number of 
iterations is unacceptably large; in such cases, preconditioning is needed. The aim of 
preconditioning is to speed up convergence by transforming the given linear system 
into an equivalent system (or one from which it is easy to recover the solution of the 
original system) that has nicer numerical properties. For example, the transformed 
system could be 


M~'Ax = M~'b, 


where the matrix M is the preconditioner and M~! denotes its inverse. Knowledge 
of the underlying problem, such as whether or not it arises from a partial differential 
equation, can help in the construction of an effective preconditioner. Otherwise, 
purely algebraic approaches that simply take the entries of A as input may be used. 
The class of algebraic preconditioners includes those based on incomplete (or 
approximate) factorizations of A. In this case, possible questions include: can some 
of the factor entries be discarded to obtain a sparser but approximate factor that 
is useful as a preconditioner? If so, which entries can be discarded? What are the 
implications of this on the associated computational costs? 

This book uses a unified framework to address such questions for direct methods 
and algebraic preconditioners, examining both the theoretical and algorithmic 
aspects of solving large-scale linear systems of equations. 


1.2 Introductory Terminology and Concepts 


Our interest is in solving linear systems of equations 
Ax =b, (1.1) 


where the matrix A € R’*”",1 < i < n, is nonsingular and sparse, the right- 
hand side vector b € R” is given (it may be sparse or dense), and x € R” is the 
required solution vector. n is the order (or dimension) of A and the length of x 
and b. Although we focus on real A, many of the results and algorithms we present 
are valid for complex A. 
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Entries of A are referred to using the notation 


A= (aij), 1<i,j <n. 


An entry whose value is not zero (or is treated as not being equal to zero) is called a 
nonzero. Column j of A is denoted by A1:n,j (or A. j) and row i by Ai, 1:n (or A;,:). 
Ai: j,k: denotes the (j — i + 1) x (L — k + 1) submatrix of A comprising rows i to 
j, columns k to /. A is diagonal if for all i A j, aij = 0; it is lower triangular if 
for alli < j, aij = 0; it is upper triangular if for alli > j, ajj = 0. A is unit 
triangular if it is triangular and all the entries on the diagonal are equal to unity. 

The matrix A is structurally symmetric if for all i and j for which a;; is nonzero 
the entry aj; is also nonzero. A is symmetric if 


aij =aji, forall i, j. 


Otherwise, A is nonsymmetric. The symmetry index s(A) of A is defined to be 
the number of nonzeros a;;,i # j, for which aj; is also nonzero divided by the total 
number of off-diagonal nonzeros. Small values of s(A) indicate the matrix is far 
from symmetric, while values close to unity indicate an almost symmetric pattern. 
A is symmetric positive definite (SPD) if it is symmetric and satisfies 


vľ Av > 0 forall nonzero v € R”. 


Otherwise, A is symmetric indefinite. An important class of symmetric indefinite 
matrices are saddle point matrices of the form 


T 
re G R l 
R B 
where G e R™*", B e R@*™, R e R”@*" with nı + n2 = n, G is an SPD 
matrix, and B is a symmetric positive semidefinite matrix (that is vf Bv > 0 for all 
nonzero v € R”2). In some applications, B = 0. 


As we will see later, it can be useful to partition the general matrix A into blocks. 
We formally express the partitioning as 


A = (Aib, jb), Aip, jp E R", 1 <ib, jb < nb, (1.2) 
that is, 


Ai A12 +++ AÁinb 
A21 A22 +++ Agno 


Anb,1 Anb,2 ies Anb,nb 
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We assume the square blocks A jp, jb on the diagonal are nonsingular. We say that 
A is block diagonal if Ajp, jb = O for all ib 4 jb. A is block lower triangular if 
Al: jb-1, jb = 0,2 < jb < nb, and it is block upper triangular if A jb+1:nb, jb = 0, 
1< jb<nb-1. 

Direct methods factorize the sparse matrix A into a product of other sparse 
matrices; what is an appropriate factorization depends on the properties of A. In 
this book, the focus is on the following variants of Gaussian elimination. 


e For symmetric positive definite A, the Cholesky factorization A = LL’, 
where L is a lower triangular matrix with positive diagonal entries. Observe that 
this can be rewritten as A = LDL’, where L is a unit lower triangular matrix and 
D is a diagonal matrix with positive diagonal entries. This is called the square 
root-free Cholesky factorization. If the context is clear, we will simplify the 
notation and use L (rather than L) for the square root-free Cholesky factor. 

¢ For symmetric indefinite A, the LDLT factorization A = LDL‘, where L isa 
unit lower triangular matrix and D is a block diagonal matrix with blocks of size 
1 or 2 on the diagonal. 

e For nonsymmetric A, the LU factorization A = LU, where L is a unit lower 
triangular matrix and U is an upper triangular matrix. Gaussian elimination is 
one process to put a matrix into LU form. The factorization can be rewritten as 
A = LDU, where U is a unit upper triangular matrix and D is a diagonal matrix. 
This is called the LDU factorization. 


As already observed, A is sparse if many of its entries are zero. Frequently, large 
matrices that arise in practical problems are sparse, and when solving large-scale 
linear systems, taking advantage of the sparsity is essential; indeed, many problems 
are intractable unless advantage is taken of sparsity to reduce the computational 
costs in terms of storage and the number of operations that must be performed. 
What proportion of the entries needs to be zero for the matrix to be considered as 
sparse is not fixed and can depend on the pattern of the entries, the operations to be 
performed, and the computer architecture. There have been attempts to formalize 
matrix sparsity more precisely. For example, a matrix of order n may be said to be 
sparse if it has O(n) nonzeros. But here we choose not to employ a formal definition. 
Instead, we say that A is sparse if it is advantageous to exploit its zero entries. 
Otherwise, A is dense. 

The sparsity pattern S{A} of A is the set of nonzeros, that is, 


S{A} = {@, D| aij #0, 1 <i, j <n}. 


The number of nonzeros in A is denoted by nz(A) (or |S{A}}). A is structurally (or 
symbolically) singular if there are no values of the nz(A) entries of A whose row 
and column indices belong to S{A} for which A is nonsingular. S{A} is symmetric 
if for all i and j, a;; # 0 if and only if aj; 4 O (the values of the two entries need 
not be the same). If S{A} is symmetric, then A is structurally symmetric. 
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In some situations, sparse vectors (vectors that contain many zero entries) are 
considered. The sparsity pattern of a vector v of length n is given by 


S{v} = {i lv: #0, 1 <i <n}, 


and |S{v}| denotes the number of nonzeros in v. Note that here and elsewhere curly 
brackets {.} are used when working with sets to help distinguish sets from vectors. 

We say that the matrix A is factorizable (or strongly regular) if its principal 
leading minors (the determinants of its principal leading submatrices) are nonzero, 
that is, if its LU factorization without row/column interchanges does not break 
down. For example, SPD matrices are factorizable. For more general A, in exact 
arithmetic, the following standard result holds. 


Theorem 1.1 (Golub & Van Loan 1996) 
If A is nonsingular, then the rows of A can be permuted so that the permuted matrix 
is factorizable. 


The row permutations do not need to be known in advance of the factorization; 
rather they can be constructed as the factorization proceeds. 


1.2.1 Phases of a Sparse Direct Solver 


A direct method for solving the sparse system (1.1) comprises a number of 
distinct phases. The matrix A is factorized, and then, given the right-hand side 
b, the factors used to compute the solution x. There is no single direct method 
that performs best on all problems and all computer architectures. Instead, many 
different algorithms have been proposed and implemented, some focussing on 
special classes of problems and/or particular architectures. However, in general, 
most approaches split the factorization into a symbolic phase (also called the 
analyse phase) and a numerical factorization phase that computes the factors. 
The symbolic phase typically uses only the sparsity pattern S{A} to compute the 
nonzero structure of the factors of A without computing the numerical values of the 
nonzeros. Following the numerical factorization, the solve phase uses the factors to 
solve for a single b or for multiple right-hand sides or for a sequence of right-hand 
sides one-by-one. 

The fill-in in the matrix factors can render a direct method infeasible. Thus the 
symbolic phase typically incorporates finding a permutation (ordering) of the rows 
and columns of A to limit fill-in. There are many different ways to look for fill- 
reducing orderings; this is discussed in Chapter 8. Once the permutation has been 
selected, the symbolic phase determines the sparsity pattern of the factors of the 
permuted matrix and other key properties such as the number of entries in each row 
and column of the factors. This is achieved using the close relationships between 
matrices and graphs, which we review in Chapter 2. A symbolic factorization can 
also be used in algorithms that construct approximate factorizations by dropping 


1.2 Introductory Terminology and Concepts 7 


nonzeros from A and factoring the resulting sparser matrix. These approximate 
factors can be employed as preconditioners for an iterative method. 

Historically, the symbolic phase was much faster than the factorization phase, 
but considerable effort has gone into parallelizing the factorization so that the gap 
between the times for the two phases has narrowed. Indeed, the ordering part of the 
symbolic phase can dominate the total solution time. To prevent the symbolic phase 
from becoming a computational bottleneck, it needs to use efficient implementations 
of sophisticated algorithms. By setting up the data structures needed for computing 
and holding the factors, the symbolic factorization contributes to the efficiency 
of the subsequent numerical factorization in terms of time and memory. In many 
applications (for instance, when solving nonlinear equations), it is necessary to solve 
a series of problems in which the numerical values of the entries of A change but 
S{A} does not. In this case, the symbolic phase can generally be performed just 
once and its cost amortized across the numerical factorizations. 


1.2.2 Comments on the Computational Environment 


The von Neumann architecture—the fundamental architecture upon which nearly 
all digital computers have been based—involves the union of a central processing 
unit (CPU) and the memory, interconnected via input/output (I/O) mechanisms, 
as depicted in Figure 1.3. Despite being extremely simple, this sequential model 
remains useful, although nowadays the role of the CPU is undertaken by a mixture of 
powerful processors, co-processors, cores, GPUs, and so on, and current computer 
architectures employ complex memory hierarchies. Performing arithmetic opera- 
tions on the processing units is much faster than communication-based operations. 
Moreover, improvements in the speed of the processing units outpace those in the 
memory-based hardware. Moore’s law is an example of an experimentally derived 
observation of this kind. 


CPU 


1/0 


Memory 


Figure 1.3 A simple uniprocessor von Neumann computer model. 
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Two important milestones in processor development have been multiple func- 
tional units that compute identical numerical operations in parallel and data 
pipelining (also called vectorization) that enables the efficient processing of 
vectors and matrices. Vectorization is often supported by additional hardware and 
software tools (for instance, instruction pipelining) and by memory components 
such as registers and by memory architectures with multiple layers, including 
small but fast memories called caches. Superscalar processors that enable the 
overlapping of identical (or different) arithmetic operations during runtime have 
been a standard component of computers since the 1990s. The ever-increasing 
heterogeneity of processing units and their hardware environment inside computers 
has led to significant effort being invested to support code implementations. For 
example, expressing the code via units of scheduling and execution called threads. 

A key objective of many numerical linear algebra algorithms is reducing time to 
solution. This is usually bound by one of the following. 


e Compute throughput, that is, the number of arithmetic operations that can be 
performed per cycle. 

e Memory throughput, that is, the number of operands than can be fetched from 
memory/cache and/or registers each cycle. 

e Latency, which is the time from initiating a compute instruction or memory 
request before it is completed and the result available for use in the next 
computation. 


Depending on which of these is the constraining factor, a given algorithm is said to 
be compute-bound, memory-bound, or latency-bound. Latency can often be hidden 
by performing non-dependent operations arising from a different part of a vector 
or matrix while waiting for a result, and as such is most typically a constraining 
factor for small problems or, more rarely, in the execution of complex algorithms 
on less powerful processors where resource limitation (for example, the number of 
registers) prevents such approaches. 

On modern machines, the memory throughput is normally much lower than that 
required to keep all functional units busy without significant reuse of operands, 
and this is generally true at all levels of cache. It can be useful to consider an 
algorithm’s compute intensity, that is, the ratio of the number of operations to 
the number of operands read from memory. Most chips are designed such that 
dense matrix—matrix multiply, which typically performs n? operations on n? data 
(with ratio k for a blocked algorithm with block size k), can run at full compute 
throughput, while matrix-vector multiply performs n? operations on n? data (ratio 
1) and is limited by the memory throughput. The development of basic linear algebra 
subroutines (BLAS) for performing common linear algebra operations on dense 
matrices was partially motivated by obtaining a high ratio. In the late 1980s, matrix— 
matrix operations (implemented by Level 3 BLAS) became a must once computers 
were able to store matrix blocks with accompanying processor instructions inside 
registers and fast caches. Matrix—matrix operations are able to take advantage of 
the fact that data that are reused within a small amount of time or are stored in 
close memory locations (temporal and spatial locality) are processed efficiently. 
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Consequently, employing Level 3 BLAS when designing and implementing matrix 
algorithms (for both sparse and dense matrices) can improve performance compared 
to using Level 1 and Level 2 BLAS. 

There are other important motivations behind using the BLAS. In particular, they 
facilitate software development by providing standardized codes for performing 
common vector and matrix operations that are robust, efficient, and portable. 
Machine-specific optimized BLAS libraries are available for a wide variety of 
computer architectures, and because of the importance and widespread use of the 
BLAS, new implementations are provided by computer vendors as architectures 
change. 

In this book, we discuss the design of algorithms that aim to achieve compu- 
tational efficiency through exploiting data locality and using established matrix 
block and vector operations as fundamental building blocks. We assume an idealized 
computer model, not a specific architecture or language. 


1.2.3 Finite Precision Arithmetic 


When designing numerical algorithms, it is important to consider how the numerical 
operations are performed and the effects of computational errors. Finite precision 
arithmetic underlies all computations that are performed numerically. Historically, 
computer arithmetic varied greatly between different computer manufacturers, and 
this was a source of many problems when attempting to write software that could be 
easily ported between computers. Variations were reduced significantly in 1985 with 
the development of the Institute for Electrical and Electronic Engineering (IEEE) 
standard for computer floating-point arithmetic. The IEEE standard is now widely 
used, and the majority of contemporary computers represent real numbers using 
binary floating-point arithmetic that expresses real numbers as 


a=td.d2...d; x25, 


where k is an integer and d; € {0, 1}, 1 < i < t, with dı = 1 unless d2 = d3 = 
... = d; = 0. The number of digits ¢ is 24 in single precision and 53 in double 
precision. The exponent k lies in the range —126 < k < 127 in single precision and 
—1022 < k < 1023 in double precision. Floating-point operations can be written as 


fl(a op b) = (a op b) + ô), ld] < €, 


where op is a mathematical operation (such as =, +, —, x, /, „/) and (a op b) is the 
exact result of the operation, and € is the machine precision (or unit roundoff). 2 x € 
is the smallest floating-point number that when added to the floating-point number 
1.0 produces a result that is different from 1.0. For IEEE single precision arithmetic, 
e is 2-74 ~ 1077 and for double precision € = 275 ~ 107!°. Any operation 
on floating-point numbers should be thought of as introducing a relative error of 
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absolute value at most e. When the results of such operations are fed into other 
operations to form an algorithm, these errors propagate through the calculations. 
The two main sources of computational errors that are consequences of floating- 
point arithmetic are rounding errors and truncation errors. Certain operations can 
amplify the errors and lead to catastrophic failure when algorithms that are exact in 
conventional arithmetic are executed in floating-point arithmetic. Such algorithms 
are said to be numerically unstable; for sparse linear systems, this is discussed in 
Chapter 7. 


1.2.4 Bit Compatibility 


For sequential solvers, achieving bit compatibility (in the sense that two runs on 
the same machine using the same binary and identical input data should produce 
identical output) is not a problem. But enforcing bit compatibility can limit dynamic 
parallelism, and when designing parallel sparse solvers, the objective of efficiency 
potentially conflicts with that of bit compatibility. Bit compatibility is essential for 
some users because of regulatory requirements (for example, within the nuclear or 
financial industries) or to build trust in their software from nontechnical users (who 
may find the non-reproducibility of results worrying or unacceptable). For others, it 
is just a desirable feature for debugging purposes. Often linear solves occur at the 
core of much more complicated codes that typically feature heuristics that can be 
sensitive to very small changes in the linear solutions found. 

The critical issue is the way in which N numbers (or, more generally, matrices) 
are assembled, that is, 


N 
sum = ` Cj, 


j=l 


where the Cj are computed using one or more processors. The assembly is 
commutative but, because of the potential rounding of the intermediate results, is 
not associative so that the result sum depends on the order in which the Cj are 
assembled. A straightforward approach to achieving bit compatibility is to enforce a 
defined order on each assembly operation, independent of the number of processors, 
but this may adversely limit the scope for parallelism. 


1.2.5 Complexity of Algorithms 


The computational complexity of a numerical algorithm is typically based on 
estimating asymptotically the number of integer or floating-point operations or 
the memory usage. Computational complexity is expressed as a function of the 
algorithm’s input parameters (typically the problem size) and is concerned with 
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how fast that function grows. Only the highest order terms are considered: scalar 
factors and lower order terms are ignored. For simplicity, consider a single input 
parameter. A real function y(d) of a nonnegative real d satisfies y = O(g) if there 
exist positive constants c and do such that 


|y(d)| < cg(d) forall d > do. 


O(g) bounds y asymptotically from above. As a simple illustration, consider the 
quadratic function in d 


y(d) = ad* + d- y, a £0. 


In this case, y(d) = O (d°), and the coefficient of the highest asymptotic term is a. 
In some cases, a function can also be asymptotically bounded from below. However, 
we will only use the O(.) notation because it is more important for sparse matrix 
algorithms to specify upper bounds than to discuss special cases that may imply 
lower bounds. 

Computational complexity can estimate quantities related to the worst-case 
behaviour of an algorithm or its average behaviour. When considering complexity 
based on operation counts, as a result of using a unit-cost random-access computer 
model, it is common to assume the operations have a unit cost. But in practice 
there can be a significant difference between the cost of operations, such as addition 
and subtraction, and operations with integer operands or operations using different 
precisions. Division and square root operations can be significantly more expensive 
than multiply/add operations; the difference is highly dependent on the computing 
platform. Thus, unit cost can be a significant simplification, and counting floating- 
point operations is arguably of limited value in assessing the performance of 
different algorithms on modern computers. Nevertheless, sparse matrix algorithms 
that are O(n?) are considered to be computationally too expensive: the goal when 
designing algorithms is that they should be of linear (or close to linear) in the input, 
that is, linear in n or nz(A). Linear complexity is often achieved in the symbolic 
phase of a sparse direct solver, but the complexity of the numerical factorization 
phase is typically higher and may determine the size of the linear systems that can be 
solved using a sparse direct method. However, for modern computer architectures, 
the number of floating-point operations is not necessarily a good indicator of 
the time required to solve the linear system. Indeed, parallel implementations of 
algorithms that perform more operations than the minimum needed can lead to 
reductions in the runtime because costly data movements and synchronizations can 
be limited by, for example, duplicating operations on multiple processors. 

As computers have become more powerful (in terms of both the computational 
speed and the available memory), the size of the linear systems that can be solved 
using a (parallel) dense method that ignores sparsity in A has steadily increased; 
nowadays linear systems with n of the order 10° can potentially be tackled using 
a dense solver (although if A is sparse, the operation count and solution time will 
generally be greatly reduced by using algorithms that limit operations on zeros). 
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Many practical applications lead to systems where A is sparse and n is significantly 
larger than this. The size of systems that can be solved using a sparse direct method 
has also steadily increased over the years, and the algorithms they use have become 
ever more sophisticated so that it is commonplace to solve systems of order greater 
than 10’. But the complexity does limit the problem size, and for very large systems, 
an iterative solver is often the only option. 

In computer science, complexity theory introduces additional concepts and 
distinguishes between problems for which algorithms of polynomial complexity 
exist and those where a hypothesis is that only algorithms of super polynomial 
complexity exist. Without going into detail, we refer to problems in this latter class 
as being combinatorially hard. 


1.3 Sparse Matrices and Their Representation in a 
Computer 


To implement sparse matrix algorithms on a computer requires special data 
structures and storage schemes that allow matrices and vectors to be stored, 
retrieved, manipulated, and updated. There are many ways to do this; key to them all 
is that they must be compact and avoid storing and manipulating numerically zero 
entries. 


1.3.1 Sparse Vector Storage 


A sparse vector can be stored using a real array for the nonzero values together 
with an integer array containing the indices of these entries, as demonstrated by the 
following example. 


Example 1.1 Let v be the sparse row vector 
v=(l. -2. 0. -3. 0. 5. 3. 0). (1.3) 


The real array valV that stores the nonzero values and corresponding integer array 
of their indices indV is of length |S{v}| = 5 and is as follows: 


Subscripts 1 2 3 4 5 
valv lyo =2. 3p. Bu 3 
indv Te -2 4 6 


Alternatively, a linked list can be used. While modern programming languages 
often support linked lists directly as an abstract data structure, in sparse matrix 
algorithms it is usual to implement them explicitly using arrays together with an 
integer that points to the first entry (the header pointer). Each entry is associated 
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with a link that points to the next entry or is null if the entry is the last in the list. 
The links can be adjusted so that the values are scanned in a different order without 
moving the physical locations. Storing the vector (1.3) as a linked list is illustrated 
in Example 1.2. Here v is stored in two different ways, emphasizing that the order 
of the entries is determined by the links, not by the physical locations of the entries. 


Example 1.2 Two possible ways of storing the sparse vector (1.3) using linked lists. 


Subscripts 1 2 3 4 5 Subscripts 1 2 3 4 5 
Values ih 3, SS Values 5. 3. 1. —2. —3 
Indices 1 2 4 6 7 Indices 6 7 1 2 4 
Links 2 3 4 5 0 Links 2 0 4 5 1 
Header 1 Header 3 


There are two important reasons for using linked lists. Firstly, it is straightforward 
to add extra entries, and secondly, entries can be removed without any data 
movement. This is illustrated in Example 1.3. Linked lists are an example of a 
dynamic structure. 


Example 1.3 On the left, an entry —4 has been added to the sparse vector (1.3) in 
position 5, and, on the right, the entry —2 in position 2 has been removed. x indicates 
the entry is not accessed. The links that have changed are in bold. 


Subscripts 1 2 3 4 5 6 Subscripts 1 2 3 4 5 
Values 1. -—2. -3. 5. 3. —4. Values l. x -3. 5. 3. 
Indices 1 2 4 6 7 5 Indices 1 x 4 6 7 
Links 2 3 4 5 6 0 Links 3 «x 4 5 0 
Header 1 Header 1 


1.3.2 Sparse Matrix Storage 


The vector data structures can be generalized to sparse matrices. The simplest way to 
store a sparse matrix is using coordinate (or triplet) format. The individual entries 
of A are held as triplets (i, j, aij), where i is the row index and j is the column index 
of the entry a;; # 0. Three arrays (one real and two integer) each of length nz(A) 
are needed. Although this form is easy to create, it is not efficient for manipulating 
sparse matrices (for example, just adding two sparse matrices with different sparsity 
structures presents difficulties). 

The CSR (Compressed Sparse Row) format is widely used. The column indices 
of the entries of A are held by rows in an integer array (which we will call 
colindA) of length nz(A), with those in row 1 followed by those in row 2, and 
so on (with no space between rows). Often, within each row, the entries are held by 
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increasing column index. A real array valA of the same length holds the values of 
the corresponding entries of A in the same order. A third array rowpt rA of length 
n + 1 is such that its i-th entry points to the position of the start of row i (1 <i <n) 
of A within colindA and valA, and rowptrA(n + 1) is set tonz(A) + 1. 

CSC (Compressed Sparse Columns) format is defined analogously by holding 
the entries by columns, rather than by rows. If A is symmetric, only the lower (or 
upper) triangular part is generally stored. If the matrix values are not stored, the 
arrays rowptrA and colinda represent the graph G(A), which we discuss in the 
next chapter. 


Example 1.4 Let A be the sparse matrix 


12345 
1/3. Eo 
2 1. 4. 
ASS 3. L|. (1.4) 
4 1. 
5 7. 6. 


Coordinate format represents A as follows. Note that the entries are in no 
particular order. 


Subscripts 1 2 3 4 5 6 7 8 9 10 
rowinda 3 2 3 4 1 1253 5 
colinda 3 2 1 4 4 1 5 5 5 2 
valA 3. 1 -l 1-2. 3. 4 6 1. 7. 


CSR format represents A as follows. Here the entries within each row are in order 
of increasing column index. This additional condition is often but not always used. 


Subscripts 1 2 3: -4- 5 6 7 8 9 10 
rowptrA 1 3 5 8 9 II 

colinda 1 4 2 5 1 3 5 4 5 
valA 3. -2. 1 4 -1. 3. 1. 


The CSR and CSC formats are static data structures. While reading A is 
straightforward, it can be difficult to make modifications, for instance, adding a 
new entry at a specified location. Removing an entry is also problematic. The value 
of the entry could be set to zero, but if a significant number of entries are set to 
zero, this may not be efficient because, when A is used, operations are performed 
on zeros and more memory than is necessary is used. Adding and deleting entries 
are possible if the sparse rows or columns are stored using linked lists. 


Example 1.5 The matrix in (1.4) can be held as a collection of columns, each in a 
linked list, as follows. Here the array colA_head holds header pointers, with the 
i-th entry pointing to the location of the first entry in column i. 
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Subscripts 1.2. -3 4 5 6 7 8 9 10 
rowindA Sal 3 4 1 1 2 5 3 5 
valA 3, 1. -L 1. 2.3. 4 6 1 7. 
link 0 10 0 0 4 3 9 0 8 O 
colA head 6 2 1 5 7 


For column 4, colA_head(4) = 5, rowindA(5) = 1 and valA(5) = —2, so the 
first entry in column 4 is aj4 = —2. Next, link(5) = 4, rowindA(4) = 4, and 
valA(4) = 1, so the second entry in column 4 is a44 = 1. Because link(4) = 0, 
there are no more entries in the column. If we want to add an entry to the (3, 4) 
position while retaining the order of the entries within column 4, then we do this by 
setting valA(11) to hold the new entry, and rowindA(11) = 3, link(5) = 11, 
and link(11) = 4 (the original value of 1ink(5)). The resulting link array is 
shown below, with the entries that have changed given in bold. 


Subscripts 1 2 3 4 5 6 7 8 9 10 H 
link 0 10 0 0 11: 3 9080 4 


A disadvantage of linked list storage is that it prohibits the fast access to rows 
(or columns) of the matrix that is needed for efficient processing on contemporary 
computers that use vectorization and/or work with matrix blocks. Consequently, 
CSR or CSC formats are commonly used in sparse direct methods. 

Static data structures are efficient for sparse matrix factorizations if the sparsity 
structures of the factors are known before the factorization begins. However, it is 
often the case that new nonzero entries need to be added and/or others need to be 
removed, and it is not necessarily possible to predict the required space in advance. 
A storage scheme that has some space to embed new nonzeros is the DS (Dynamic 
Sparse) format. It stores the nonzeros of both the rows and columns of A in real 
arrays valAR and valACc, with the corresponding row and column indices held 
in integer arrays rowindA and colindaA. Pointers to the start of each row and 
column are stored in the integer arrays rowpt rA and colptrA, as in the CSR and 
CSC formats. In addition, the lengths of the compressed rows and columns (which 
are called row and column segments) are stored separately. In some situations, it 
can be sufficient to hold only the row (or the column) information (DSR and DSC 
formats). The following example illustrates the DS format. 


Example 1.6 Consider again the matrix given by (1.4). The DS format represents A 
using two sets of arrays. The first four store the matrix by rows, and the second 
four store it by columns. The entries are in no particular order in both sets of 
arrays. The arrays rlength and clength hold the numbers of entries in the rows 
and columns, respectively. Free space between segments can be used to store new 
nonzero entries, and it is this that makes the storage scheme efficient, provided the 
number of changes to the matrix structure during the factorization is limited. 
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Subscripts 1 34 5 6 7 8 9 10 11 12 13 14 15 
rowptrA 1 5 8 12 14 

colindA 1 4 2 5 1 3 5 4 2°. 5 
valAR 3. <2 1. =l. 3. 1 7. 6 
rlength 2 2 34 A 2 

colptrA 1 6 9 12 

rowindA 1 3 2° S53 1 4 2 3; 5 
valAc 3.1, L 7. 3 —2. 1 4 

clength 2 2 1-2- 3 


Blocked formats may be used to accelerate multiplication between a sparse 
matrix and a dense vector. Iterative methods typically require that the same sparse 
matrix is multiplied by vectors many times before a solution is found. The matrix 
can be put into a block storage format once, and then the cost of finding the blocks 
and converting the matrix format can be offset by the savings that result from 
repeatedly multiplying the matrix. The Variable Block Row (VBR) format groups 
together similar adjacent rows and columns. The numbers of such rows and columns 
can be different in each dimension, resulting in variable sized blocks. For a large 
sparse block-structured matrix, using a VBR format potentially reduces the amount 
of integer storage, and the block representation enables numerical algorithms to 
perform the kernel matrix operations more efficiently on the block entries. However, 
only heuristic algorithms are available for determining the groupings of the rows and 
columns. 

The data structure of the VBR format uses six arrays. Integer arrays rptr and 
cptr hold the index of the first row in each block row and the index of the first 
column in each block column, respectively. In many cases, the block row and 
column partitionings are conformal, and only one of these arrays is needed. The 
real array valA contains the entries of the matrix block-by-block in column-major 
order. The integer array indx holds pointers to the beginning of each block entry 
within valA. The index array bindx holds the block column indices of the block 
entries of the matrix, and finally, the integer array bptr holds pointers to the start 
of each row block in bindx. 


Example 1.7 Let A be the sparse matrix 


Wd w 
Ww 


7. 8. 9. 10 


CADUNBPWN Ke 
-— 
Ww 
— 
— 
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Here the row blocks comprise rows 1:2, 3, 4:6, and 7:8. The column blocks 
comprise columns 1:2, 3:5, 6, 7:8. The VBR format stores A as follows. 


Subscripts 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
rptr 1347 9 
cptr 1367 9 

valA 1.4.2.5. 3. 6 7. 8.9. 10. 11. 14. 12. 13. 15. 17. 16. 18. 19. 21. 22. 20. 
indx 15 7 1011 15 19 

bindx 1323 1 4 2 

bptr 13.5 7 


1.4 Notes and References 


There are some excellent textbooks that provide in-depth coverage of numerical 
linear algebra for dense matrices (such as Golub & Van Loan, 1996; Demmel, 1997; 
Trefethen & Bau, 1997, and Strang, 2007). Although sparse direct methods have 
been a constant subject for research since the 1960s and despite their importance and 
widespread use, there has only ever been a handful of books focusing on them. The 
most recent are Davis (2006) and Duff et al. (2017), but see also Tewarson (1973), 
George & Liu (1981), Pissanetzky (1984), and Zlatev (1991). In addition, Meurant 
(1999) covers both direct and iterative methods. The books by Björck (1996, 2015) 
and Wendland (2017) are also relevant. 

We focus on factorizations based on Gaussian elimination, but another important 
class of direct methods are those based on orthogonal factorizations, most notably 
QR factorizations of the form A = QR, where Q is an orthogonal matrix and R is 
an upper triangular matrix. These methods are generally more expensive than those 
that use LU factorizations (in terms of operation counts, the density of the factors, 
and the time required to solve the linear system), but they can offer advantages in 
terms of numerical stability. We refer the reader to the book by Davis (2006) for a 
study of such approaches. 

Over the last fifty years, in addition to the huge quantity of journal articles 
relating to specific aspects of sparse direct methods, a number of useful survey 
and overview papers have been published. These not only summarize important 
aspects of sparse direct methods but provide interesting historical perspectives on 
the theoretical, algorithmic, and software developments in the field. Early surveys 
include Tewarson (1970), Reid (1974), Duff (1977, 1981), while the comprehensive 
survey of Demmel et al. (1993) sums up early developments in parallel sparse direct 
solvers. Gould et al. (2007) look specifically at software that implements sparse 
direct methods, while the excellent survey of Davis et al. (2016) includes many 
further references to review papers and early conference proceedings where some of 
the key ideas related to sparse direct methods were first introduced. A short overview 
of modern sparse elimination methods is given by Bollh6fer et al. (2020). 
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A wide range of books devoted to iterative methods for solving large-scale 
linear systems have been written, for example, Axelsson (1994), Greenbaum (1997), 
Saad (2003b), van der Vorst (2003), Olshanskii & Tyrtyshnikov (2014), Meurant & 
Duintjer Tebbens (2020), Bai & Pan (2021), and Ciaramella & Gander (2022). 

There are many references to contemporary computational environments. To 
understand the basic principles and connection of computations with basic linear 
algebra subroutines (BLAS), a good starting point is Dongarra et al. (1998), while 
contributions in van der Vorst & Van Dooren (2015) provide a general resource on 
parallel computation in numerical linear algebra. Specific features of finite precision 
arithmetic in this field are clearly and thoroughly explained in Higham (2002). 
For the complexity of algorithms as well as for much of the terminology related 
to the sparse data structures used in this book, we refer to Tarjan (1983); we also 
recommend Cormen et al. (2009) or Skiena (2020). 

Texts providing details of the storage formats that are primarily for sparse 
direct methods include Pissanetzky (1984), Østerby & Zlatev (1983) (this discusses, 
in particular, dynamic data structures; see also the technical report of Duff, 
1980). Storage schemes used in connection to preconditioned iterative methods are 
considered in Saad (2003b). VBR and other sparse storage formats are described, 
for example, in the SPARSKIT library documentation of Saad (1994b). Buluç et al. 
(2011) provide a good review and evaluation of storage formats for sparse matrices 
and their impact on primitive operations. 


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, 
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate 
credit to the original author(s) and the source, provide a link to the Creative Commons license and 
indicate if changes were made. 

The images or other third party material in this chapter are included in the chapter’s Creative 
Commons license, unless indicated otherwise in a credit line to the material. If material is not 
included in the chapter’s Creative Commons license and your intended use is not permitted by 
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from 
the copyright holder. 


Chapter 2 A 
Sparse Matrices and Their Graphs P 


The choice of data structure is one of the most important steps in 
algorithm design and implementation. Sparse matrix algorithms 
are no exception. The representation of a sparse matrix not only 
determines the efficiency of the algorithm, but also influences 
the algorithm design process —Bulug et al. (2011). 


Every sparse matrix problem is a graph problem and every 
graph problem is a sparse matrix problem —Gilbert et al. (2006). 


Many sparse matrix algorithms exploit the close relationship between matrices and 
graphs. We make no assumption regarding the reader’s prior knowledge of graph 
theory. The purpose of this chapter is to summarize basic concepts from graph 
theory that will be exploited later and to establish the notation and terminology 
that will be used throughout. 


2.1 Introduction to Graphs 


A graph G = (V, £) is a finite set V of vertices (or nodes), and a set € of edges 
defined as pairs of distinct vertices. When there is no distinction between the pairs 
of vertices (u, v) and (v, u), the edges are represented by unordered pairs, and the 
graph is undirected. If, however, the pairs are ordered, the graph is a directed 
graph, or a digraph. Examples of simple graphs are given in Figures 2.1 and 2.2. 
A labelling (or ordering) of a graph G = (V, €) with n vertices is a bijection of 


{1,2,...,} onto V. The integer i (1 < i < n) assigned to a vertex in V is called 
the label (or simply the number) of that vertex. Our standard choice of vertices will 
be V = {1,..., n} so that the vertices are directly identified by their labels. 


Gs = (Vs, Es) is a subgraph of G = (V, E) if and only if V, C V and Es C E 
and (us, Vs) € Es implies us, Vs € Vs. The subgraph is an induced subgraph if Es 
contains all the edges in € that have both u and v in Y;. Two graphs G = (V, €) 
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Figure 2.1 An example of an undirected graph. 


(3) 
O—G O 
Q 


Figure 2.2 An example of a directed graph (digraph). The arrows indicate the direction of an 
edge. There are an edge (4 —> 5) and an edge (5 — 4). 


and Gs = (Vs, Es) are isomorphic if there is a bijection g : Y — V; that preserves 
adjacency, that is, (u, v) € € if and only if (g (u), g(v)) € Es. 

In an undirected graph, two vertices u and v in V are said to be adjacent (or 
neighbours) if e = (u,v) € E; the edge e is incident to the vertex u and to the 


vertex v. We also use the notation (u <—> v) for an edge (or (u pale v) to 
emphasize the edge belongs to the graph G). The degree degg(u) of u € V is the 
number of vertices in V that are adjacent to u, and the adjacency set adjg{u} is 
the set of these adjacent vertices (thus |adjg{u}| = degg(u)). If Vs is a subset of 
the vertices, then the adjacency set adjg{V;} is the set of vertices in V \ Vs that 
are adjacent to at least one vertex in V,. A subgraph is a clique when every pair 
of vertices is adjacent. In the example in Figure 2.1, degg(2) = 4 and adjg{2} = 
{1, 3, 4, 6}. The induced subgraph with vertices V; = {2, 4, 6} is a clique. 


In a digraph, we use the notation (u — v) or (u es v) for a directed edge. 
There can be an edge (u — v) but no edge (v — u). The adjacency set of u can be 
split into two parts 


adjġ {u} = {v | (u > v) €E} and adjg{u} = {v | w > u) € £}. 


In the example given in Figure 2.2, adjg {2} = {3, 4} and adjg {2} =; 
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2.2 Walks, Paths, Cycles, and DAGs 


A sequence of k edges in an undirected graph G 


ug <—> uy <>... <> Ug_| <— Uk 


is called a walk of length k. If G is a digraph, then the sequence 


uo > Ut Parkes > Hki > Uk 


is a directed walk. The vertices uo and ug are connected by the walk, and for k > 0, 
ux is said to be reachable from ug; the set of vertices that are reachable from uo 
is denoted by Reach(uo). The walk is closed if uo = ug; a closed walk is called a 
cycle. Graphs that do not contain cycles are acyclic. A (directed) trail is a (directed) 
walk in which all the edges are distinct and a (directed) path is a (directed) trail in 
which all the vertices (and therefore also all the edges) are distinct. The distance 
between two vertices is the number of edges in the shortest path connecting them 
(this is also called the length of the path). In Figure 2.2, there is a path of length 4 
from vertex | to vertex 7 but no path from vertex 7 to vertex 1. 

In the undirected graph G = (V, E), a path between a pair of its vertices with 
labels i and j is denoted by 


(eG 


or, if it is clear which graph the path is in, by 
i > j. 


If all intermediate vertices on the path are less than min{i, j}, then the path is called 
a fill-path and is denoted by 


i4 j o i4 j 
min min 
If all intermediate vertices on the path belong to a subset Vy, then the path is denoted 
by 
., G ; : ; 
i45 j o ies j. 
Vs Vs 
If G is a digraph, the double-sided arrow symbols are replaced by one-sided ones 
=> in the direction of the edges. For example, 
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Figure 2.3 An example of a DAG with two different topological orderings (see Section 4.4). 


Om n 


Figure 2.4 An example of an undirected graph to illustrate reachability. If V, = {4,5}, then 
Reach(2, Vs) = {1, 3, 6} and Reach(6, Vs) = {2, 3, 7}. 


ij isj i=j mad i=j. 
min V; 

A very important special case of a digraph is one with no cycles. A directed 
acyclic graph is called as DAG. In a DAG, if there is a path u = > v of nonzero 
length, then u is called an ancestor of v and v is said to be a descendant of u. 
Figure 2.3 depicts a DAG with two different orderings. For the labelling of the 
vertices on the left, vertices 2, 3, 5, and 6 are descendants of vertex 1, but only 
vertices 5 and 6 are descendants of vertex 4. Note that if the direction of each edge 
in a DAG is reversed, the resulting graph is also a DAG. 

The notion of a reachable set is useful for the study of Gaussian elimination. 
Given a graph and a subset V, of its vertices, if u and v are two distinct vertices that 
do not belong to Vs, then v is reachable from u through V, if u and v are connected 
by a path that is either of length 1 or is composed entirely of vertices that belong 
to V; (except for the endpoints u and v). Given V, and u ¢ Vs, the reachable set 
Reach(u, Vs) is the set of all vertices that are reachable from u through Vs. Note 
that if V, is empty or u does not belong to ad jg (Vs), then Reach(u, Vs) = adjg(u). 
A simple example is given in Figure 2.4. 
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2.3 Trees, Components, and Connectivity 


An undirected graph is connected if every pair of vertices is connected by a path. 
A connected acyclic graph is called a tree, that is, a tree is an undirected graph in 
which any two vertices are connected by exactly one path. Every tree has at least 
two vertices of degree 1. Such vertices are called leaf vertices. A graph is a forest if 
it consists of a disjoint union of trees. This is illustrated in Figure 2.5. 

If G is connected, then a spanning tree of G is a subgraph of G that is a tree 
containing every vertex of G. In general, a graph may have several spanning trees, 
but a graph that is not connected does not contain a spanning tree. 

The concept of connectivity can be extended to the general case. A digraph G = 
(V, E) is strongly connected if for every pair of vertices u, v € V there is a path 
from u to v and a path from v to u. 

An equivalence relation defined for a collection of pairs of members of a set is a 
relation that satisfies three simple properties: reflexivity, symmetry, and transitivity. 
A key property of an equivalence relation on a set is that it induces a partitioning of 
the set. Strong connectivity is an equivalence relation on V. It induces a partitioning 
Y= VU... U Vs such that each V; (1 < i < s) is strongly connected and 
is maximal with this property: no additional vertices from G can be included in 
Vi without breaking its strong connectivity. The V; are called strongly connected 
components (or sometimes just strong components) of G. 

Any undirected tree 7 = (V, E) can be converted into a directed rooted tree 
T’ = (V, E’) by specifying a root vertex r. Note that r can be chosen arbitrarily: 
any choice gives a directed rooted tree. An edge (u, v) € € becomes a directed edge 
(u — v) € €’ if there is a path from u to r such that the first edge of this path is 
from u to v. Given r, this directed path is unique. We illustrate this transformation 
in Figure 2.6. v is called the parent of u if the directed edge (u > v) € E'; u is 
said to be a child of v (two or more child vertices are referred to as children). Two 
vertices in a rooted tree are siblings if they have the same parent. Leaf vertices have 
no children. A rooted tree is a special case of a DAG. 


Figure 2.5 An example of an undirected graph with 12 vertices that is a forest (it consists of two 
disjoint trees). Vertices 1, 2, 3, 6, 7, 8, and 11 are leaf vertices. 
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Figure 2.6 An example of an undirected tree 7 (left) and the rooted tree 7” (right) obtained from 
T by choosing the root r = 4. The arrows indicate the direction of the edges. 


2.4 Adjacency Graphs 


Adjacency graphs provide a link between sparse matrices and graphs. If A is a sparse 
matrix of order n, then an adjacency graph G(A) = (V(A), €(A)) (often written 
simply as G) with n vertices V(A) = {1,...,m} can be associated with it. If A is 
structurally symmetric, then the edge set is 


E(A) = {G j) | aij #0, 1 F jt. 
A digraph can be associated with a nonsymmetric A by setting 
E(A) = {(i > j) | aij #0, i F j}. 


Each diagonal nonzero aj; corresponds to a loop or self-edge. They are generally 
omitted from G, and many algorithms that use G implicitly assume that the diagonal 
entries of A are present. Figure 2.7 depicts the sparsity patterns of two simple sparse 
matrices and their graphs. To capture not only the sparsity pattern of A but also the 
values of the entries, G can be transformed into a weighted graph using a mapping 
E€(A) > R and/or V(A) —> R. 

A special case is the directed graph associated with a triangular matrix. If L isa 
lower triangular matrix and U is an upper triangular matrix, then the directed graphs 
G(L) and G(U) have edge sets 


E(L) = {Gi > j)li; #0, i > j} and ECU) = {i > j)| wij #0, i < j}. 
(2.1) 
It is sometimes convenient to use G (LT) in which the direction of the edges is 
reversed 
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1 2 3 4 5 1-2 3.4 4 
1 * * 1 * * 
2, * * 2, * * 
3 * * 3 x ox 
4 * Ok * Ok 4 * * 
5 * Ox 5 


Figure 2.7 An example of a structurally symmetric sparse matrix and its undirected graph (left) 
and a nonsymmetric sparse matrix and its digraph (right). Arrows indicate the direction of the 
edges in the digraph. 


E(L) = {(j > i) |; #0, i > j}. (2.2) 


It is straightforward to see that G(L), G(L7), and G(U) are DAGs; they are 
sometimes referred to as elimination DAGs. 


2.5 Matrix Permutations and Orderings 


In sparse matrix algorithms, permutations are important transformations. A per- 
mutation matrix P is a square matrix that has exactly one entry equal to unity 
in each row and column, and all remaining entries are zeros (that is, it is a 
permutation of the identity matrix). Premultiplying a matrix by P reorders the rows 
and postmultiplying by P reorders the columns. P can be represented by an integer- 
valued permutation vector p, where p; is the column index of the unity within the 
i-th row of P. For example, 


0 1 0 2 
P={0 0 1] and p=] 3 
10 0 1 


The graph of a matrix A is unchanged if a symmetric permutation A’ = PAP? 
is performed, only the labelling (that is, the ordering) of the vertices changes, and 
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1 2 3 4 5 12 3 4 
1 /* x ok 1 /* 
Def ae 2 * 
3 | * * 3 * 
4 | x * 4 * 
5 \x * 5 \k * * x 


Figure 2.8 An example of an arrowhead matrix and its undirected graph (left) and a symmetri- 
cally permuted arrowhead matrix and its undirected graph (right). 


thus relabelling G(A) can be used to permute A. This invariance property is key in 
sparse matrix algorithms. As an example, consider the arrowhead matrix A and its 
graph G(A) given in Figure 2.8. The symmetrically permuted matrix A’ and G(A’) 
are also shown, with P chosen such that the first row and column of A are the last 
row and column of A’. 

The digraph G of a general matrix A is not invariant under nonsymmetric 
permutations P AQ, with Q 4 PT. A topological ordering of G is a labelling of its 
vertices such that for every edge (i —> j), vertex i precedes vertex j (i.e.,i < j). It 
can be shown that a topological ordering is possible if and only if G has no directed 
cycles, that is, it is a DAG. Any DAG has at least one topological ordering. The 
non-uniqueness of topological orderings of a DAG is shown in Figure 2.3. 


2.6 Lists, Stacks and Queues 


Sparse matrix algorithms frequently require the storage and manipulation of lists. A 
list is an ordered sequence of arbitrary elements 


(u0, U1, ..., Uk—1, Uk), (2.3) 


uo is the head of the list, and ux is its tail. An empty list is denoted by (). 
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A stack is a list in which elements can only be added to or removed from the 
head. A pointer locates the head of the stack. Let S = (uo, W1,..., Ug—1, UK) be 
a stack. push(S, v) denotes adding v onto the stack by incrementing the pointer 
by one, giving (v, uo, ... ug). pop(S, uo) denotes the stack (u1, ... ug) that results 
from decreasing the pointer by one (removing uo from the head). A queue is a list 
in which elements can be added to the tail (appended) or removed (popped) from 
the head. Consider the queue Q = (uo, u1, ...,Uk—1, Ux). The append operation 
append(Q, uķ+1) results in the queue (uo, ...uk, Uk+1), and the pop operation 
pop(Q, uo) results in the queue (u1, ... uk). 


2.7 Graph Searches 


Many sparse matrix reordering algorithms involve searching the adjacency graph 
G(A). The sequence in which the vertices are visited can be used, for example, to 
reorder the graph and hence permute the matrix. Given a start vertex, a graph search 
(also called a graph traversal) performs a step-by-step exploration of the vertices 
and edges of G(A), generating sets of visited vertices and explored edges. Let V, be 
the set of visited vertices and V, be the set of vertices that have not yet been visited. 
Following some chosen rule, the search step selects an unexplored edge such that 
one of its vertices belongs to V,. If the other vertex belongs to V,,, then this vertex 
is moved into V,, and the edge is flagged as explored. The explored edge may be 
directed or undirected; in an undirected graph, the edge (u, v) formally corresponds 
to the pair of edges (u —> v) and (v > u). 


2.7.1 Breadth-First Search 


Starting from a chosen start vertex s, a breadth-first search (BFS) explores all the 
vertices adjacent to s. It then explores all the vertices whose distance from s is 2, and 
then 3, and so on (that is, sibling vertices are visited before child vertices); a queue 
is used in its implementation. The search terminates when there are no unexplored 
edges (u, v) with u € V, and v € V, that are reachable from s. A simple example 
with s = 1 is given in Figure 2.9. All the vertices that are at the same distance from 
s are said to belong to the same level of the graph. At each level, the order in which 
the vertices are visited is not fixed. 


2.7.2 Depth-First Search 


A depth-first search (DFS) of a graph G visits child vertices before visiting sibling 
vertices; that is, it traverses the depth of a path before exploring its breadth. Starting 
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Figure 2.9 An illustration of a BFS of a connected undirected graph, with the labels indicating 


the order in which the vertices are visited. Vertices 2, 3,4, 5 are all at distance 1 from s and so 
belong to the first level; vertices 6, 7, 8 belong to the second level. 


o> 


O—O © 


Figure 2.10 An illustration of a DFS of a connected directed graph. The labels indicate the order 
in which the vertices are visited. The edges of the DFS spanning tree are in bold. 


from a chosen vertex s, the set of vertices that are visited are those vertices u 
for which a directed path from s to u exists in G. This will give different results 
depending on s and how ties are broken. In the example given in Figure 2.10, the 
search works from left to right. Like the BFS, all vertices in Reach(s) are visited. 
The edges that are traversed form a DFS spanning tree. In general, visiting all the 
edges of a graph results in a DFS forest that consists of exactly one DFS spanning 
tree for each connected component of the original graph. Thus the DFS can be used 
to compute connected components (see Algorithm 3.6). 

There are a number of ways to construct the output vertex order for a DFS. In a 
preorder list, the vertices are returned in the order in which they are added into V,, 
while in a postorder list, the vertices are in the order in which they are last visited 
during the DFS algorithm (note that the reverse of a postordering is not the same as 
preordering). For the example in Figure 2.10, the vertices are added into V, in the 
order 1, 2,3, 4,5, 6, 7, and this is the preorder list. The sequence in which the DFS 
visits the vertices is 1, 2,3, 2,4, 2, 1,5, 6,5, 1, 7, 1. In this sequence, vertex 3 is the 
first vertex to appear for the last time so the postordering starts with vertex 3. The 
next vertex to appear for the last time is vertex 4, followed by vertex 2, and so on, 
resulting in the postorder list 3, 4, 2, 6,5, 7, 1. 

Algorithm 2.1 presents a DFS and outputs both the preorder and postorder lists. 
The call dfs_step is made exactly once for each vertex v. Observe that if there is a 
path from vertex v to vertex w in the search tree, then v is labelled ahead of w in 
the preorder list and w is labelled ahead of v in postorder list. 


2.8 Notes and References 29 


ALGORITHM 2.1 Find preorder and postorder lists using a DFS 
Input: Directed graph G = (V, £). 
Output: Preorder list preorder and postorder list postorder. 


1: Vy = Ø, preorder = () and postorder = () 
2: for all v € V do 
3: if v Z V, then 


4 push(preorder, v) > Add v onto the preorder stack 
5: Vy = Vy U {v} > Add v to the set of visited vertices 
6: dfs_step(v) 

7 end if 

8: end for 

9: recursive function (dfs_step(v)) 
10: for all (v > w) € E do 
11: if w ¢ V, then 
12: push( preorder, w) > Add w onto the preorder stack 
13: Vy = Vy U {w} > Add w to the set of visited vertices 
14: dfs_step(w) > recursive search 
15: end if 
16: end for 
17: push(postorder, v) > Add v onto the postorder stack 


18: end recursive function 


2.8 Notes and References 


Graph theory has become an important mathematical tool in a wide variety of 
subjects, as well as being a mathematical discipline in its own right. There are 
many introductory textbooks. For example, the first four chapters of Wilson (1996) 
provide a basic foundation course, including definitions and examples of graphs, and 
the graduate-level textbook Bondy & Murty (2008) presents a coherent introduction 
to graph theory. The introductions to graphs given in computer science monographs 
such as Cormen et al. (2009) and Skiena (2020) are also ideal for our purposes. 

Many papers that present sparse matrix algorithms employ graph concepts. 
Significant contributions include Parter (1961), Rose (1973), Rose et al. (1976), and 
Rose & Tarjan (1978). Important ideas first appeared in the published proceedings 
of some of the early conferences that focussed on sparse matrix computations, 
including Reid (1971), Rose & Willoughby (1972), Duff (1981), and Evans (1985). 
Much of the fundamental work from the 1960s and 1970s is given in the book by 
Tewarson (1973) and summarized later by Pissanetzky (1984). The general texts on 
sparse factorizations by George & Liu (1981), Davis (2006), and Duff et al. (2017) 
provide further sources of references and examples; see also Kepner & Gilbert 
(2011). 

Discussions of data structures and graph searches can be found in Aho et al. 
(1983) and Tarjan (1983). The systematic analysis of the depth-first search algorithm 
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is given in Tarjan (1972), but backtracking techniques on which this search is based 
were used even earlier in artificial intelligence and combinatorial optimization. 


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, 
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate 
credit to the original author(s) and the source, provide a link to the Creative Commons license and 
indicate if changes were made. 

The images or other third party material in this chapter are included in the chapter’s Creative 
Commons license, unless indicated otherwise in a credit line to the material. If material is not 
included in the chapter’s Creative Commons license and your intended use is not permitted by 
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from 
the copyright holder. 


Chapter 3 A 
Introduction to Matrix Factorizations FEEN 


If numerical analysts understand anything, surely it must be 
Gaussian elimination. This is the oldest and truest of numerical 
algorithms ... This algorithm has been so successful that to 
many of us, Gaussian elimination and Ax = b are more or less 
synonymous. — Trefethen (1985). 


Gaussian elimination is the standard method for solving a 
system of linear equations. As such, it is one of the most 
ubiquitous numerical algorithms and plays a fundamental role 
in scientific computation. — Higham (2011) 


This chapter introduces the basic concepts of Gaussian elimination and its formula- 
tion as a matrix factorization that can be expressed in a number of mathematically 
equivalent but algorithmically different ways. 

Using unweighted graphs to capture the sparsity structures of matrices during 
Gaussian elimination is simplified by assuming that the result of adding, subtracting, 
or multiplying two nonzeros is nonzero. It follows that if A = LU and Ez denotes 
the set of (directed) edges of the digraph G (L), then fori > j 


aij #O implies (i > j) € EL. 


This is the non-cancellation assumption. It allows the following observation. 


Observation 3.1 The sparsity structures of the LU factors of A satisfy 
S{A} C S{L + U}. 


That is, the factors may contain entries that lie outside the sparsity structure of A. 
Such entries are termed filled entries, and together the filled entries are called the 
fill-in. The graph obtained from G(A) by adding the fill-in is called the filled graph. 


Numerical cancellations in LU factorizations rarely happen, and in general, 
they are difficult to predict, particularly in floating-point arithmetic. Thus, such 
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accidental zeros are not normally exploited in implementations, and we will ignore 
the possibility of their occurrence. 


3.1 Gaussian Elimination: An Overview 


The traditional way of describing Gaussian elimination is based on the systematic 
column-by-column annihilation of the entries in the lower triangular part of A. 
Assuming A is factorizable, this can be written formally as sequential multiplica- 
tions by column elimination matrices that yield the elimination sequence 


A= AY, AQ, AW (3.1) 
of partially eliminated matrices as follows: 
AY > A® = CAM + A® = OC AY >... > AM = Cy... QCA. 


The unit lower triangular matrices C; (1 < i < n — 1) are the column elimination 
matrices. Elementwise, assuming aj; = a! 0, the first step CAV = = A® is 


a? a® .. a\ fa? a? af? 
ARE TL? a. | Po ook 
E at Na aos a |G: a ae |. 

: 1 r : ; : 

Aja? La a a) Lo a ... a 


and provided ae # 0, the second step C2A® = A®) is 


(1) ay (1) (1) ai? qi? 
1 aji dey Ain Ayy aja wee es Qin 
2 2 (2) (2) 
1 0 2 sot? ae O ayy ... «.. a>, 
2) 142 3 3 
ase aa | 0 a .. E a... 2 
1 Eo ce re a 
Q) (2) 2 2 
an2 /a 1 0 a?) ee a 0 0 ae what 22 


The k-th partially eliminated matrix is A“. The active entries in A“ are denoted 
by an 1 < k < i,j < n (in the sparse case, many of the entries are zero), and 
the (n — k +1) x (n — k + 1) submatrix of AW containing the active entries is 
termed its active submatrix. The graph associated with the active submatrix is the 
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k-th elimination graph and is denoted by G*. If S{A} is nonsymmetric, then G* is 
a digraph. 

The inverse of each Cx is the unit lower triangular matrix that is obtained by 
changing the sign of all the off-diagonal entries, and because the product of unit 
lower triangular matrices is a unit lower triangular matrix, it is clear that provided 


a A0(L<k <n) 


A=A® = CC"... C4 A® = LU, 


n—1 
where the unit lower triangular matrix L is the product C7 Ic} EO: a > and U = 
A is an upper triangular matrix. The subdiagonal entries of L are the negative of 
the subdiagonal entries of the matrix Cı + C2 + ... + C,_1. If A is a symmetric 
positive definite (SPD) matrix, then setting U = DL’, the LU factorization can be 
written as 


A=LDL', 


which is the square root-free Cholesky factorization. Alternatively, it can be 
expressed as the Cholesky factorization 


A= (LD'?)(Lp)!, 


where the lower triangular matrix L D!/* has positive diagonal entries. 
The process of performing an LU factorization can be rewritten in the generic 


form given in Algorithm 3.1. Here each Jj, is called a multiplier, and the a are 


called pivots. The assumption that A is factorizable implies ed Æ 0 for all k. 
Algorithm 3.1 comprises three nested loops. There are six ways of assigning the 
indices to the loops, with the loops having different ranges. The performance of the 
variants can differ significantly depending on the computer architecture. The key 
difference is the way the data are accessed from the factorized part of matrix and 


ALGORITHM 3.1 Generic LU factorization 
Input: Factorizable matrix A. 
Output: LU factorization A = LU. 


1: for ————— do 

2 for —————_ do 

3 for —————_ do 

4 lir = al® fa® 

s: at) = a® — ina® 
6 end for 

7 end for 

8: end for 
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applied to the part that is not yet factorized. But in exact arithmetic, they result in 
the same L and U, which allows any of them to be used to demonstrate theoretical 
properties of LU factorizations. To identify the variants, names that derive from 
the order in which the indices are assigned to the loops can be used. The kij and 
kji variants are called submatrix LU factorizations. The schemes jik and jki 
compute the factors by columns and are called column factorizations. The final 
two are row factorizations because they proceed by rows. A row factorization can 
be considered as a column LU factorization applied to A’. 


3.1.1 Submatrix LU Factorizations 


Each outermost step of the submatrix LU variants computes one row of U and one 
column of L. The first step (k = 1) is 


_ 1 aji Aizn \_ (an At,2:n 
CiA= = i 
—Azn,1/a1 L) NArn1 An, 2n S 


where the (n — 1) x (n — 1) active submatrix 
S= A:n, 2:n _ A2:n,1A1,2:n/Q11 = A2-n,2:n = La:n,1U1,2:n 


is the Schur complement of A with respect to a11. If A is factorizable, then so too 
is S and the process can be repeated. 

More generally, the operations performed at each step k correspond to a sequence 
of rank-one updates. The resulting Schur complement can be written in terms of 
entries of the matrices from the elimination sequence and entries of the computed 
factors. After k— 1 steps (1 < k < n), the (n—k+ 1) x (n—k+1) Schur complement 
of A with respect to its (k — 1) x (k — 1) principal leading submatrix is the active 
submatrix of the partially eliminated matrix A“ given by 


Akk ++» Akn ka [kj 
SMH=P i e PP Dye [Gye ain) 
Ank Ann Jal lnj 
k-1 


kin,k:n* (3.2) 
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If A is SPD, then the Cholesky and LDLT factorizations that are special cases of 
the submatrix approach are termed right-looking (fan-out) factorizations. 


3.1.2 Column LU Factorizations 


In the column LU factorization, the outermost index in Algorithm 3.1 is j. For j = 
1, 24; = 1, and the off-diagonal entries in column 1 of L are obtained by dividing 
the corresponding entries in column 1 of A by u11 = a11. Assume j — | columns 
(1 < j <n) of L and U have been computed. The partial column factorization can 


be expressed as 
Li:j-1,1:j-1 _ (Åirj-1,:j-1 
Ui: j-1,1:j-1 = ‘ 
Lijn, :j-1 Aj:n,1:j-1 


Column j of U and then column j of L are computed using the identities 


—1 
Ui:j-1,j =L 


Gj- 1:j-14kj-Lj> Ujj =4jj— Ljaxj-iUirj-1, j» 


and 
lj = 1, Ljan, j = (Aj+:n,j — Ljtienj-1U1:j-1,))/ujj- 


Thus the strictly upper triangular part of column j of U is determined by solving 
the triangular system 


Li:j-1,1:j-1U1:j-1,j = Ái: j-1,j> 


and the strictly lower triangular part of column j of L is computed as a linear 
combination of column A j+1:n, j of A and previously computed columns of L. 

If A is symmetric and the pivots can be used in the order 1,2,... without 
modification, then there is the following link between its column LU and LDLT 
factorizations. 


Observation 3.2 The j-th diagonal entry djj (1 < j < n) of the LDLT 
factorization of the symmetric matrix A is 


J= 
ee ee ee 2 
djj = Ujj = aj; So deal ig. 
k=1 


The L factor is the same as is computed by the column LU factorization; its 
computation can be written as 
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ALGORITHM 3.2 Basic column LU factorization with partial pivoting 
Input: Nonsingular nonsymmetric matrix A. 
Output: LU factorization PA = LU, where P is a row permutation matrix. 


1: Interchange rows of A so that |a1;| = max{|aj1||1 <i <n} 
2: hi =1, wi =a, Lona = A2n,1/411 
3: for j =2:ndo 
4: Solve Ly: j—-1,1:;-1U1-j-1,j = Atsj-1,j 
5: Zl:in=j+1 = Â jin,j — Ljin,:j-1U1:j-1,j 
6: Apply row interchanges to z, A and L so that 
Iz1| = max{|zj||1 <i <n- j+1}. 
7: lj; = 1, ujj = 21 and Ljstn,j = Zan—j41/Z1 
8: end for 


ga 


djjLj+:n, j = Ajtinj — 5 L j+i:n,k dkk Lik. 
k=1 


The U factor is equal to DLT. Computing L and D in this way is called the left- 
looking (fan-in) factorization. 


So far, we have assumed that A is factorizable. If A is nonsingular, then there 
exists a row permutation matrix P such that P A is factorizable (Theorem 1.1), and 
if there are zeros on the diagonal, then the rows can always be permuted to achieve 
a nonzero diagonal. Consider the simple 2 x 2 matrix A and its LU factorization 


a= ale JC is), 


If 6 = 0, this factorization does not exist, and if 6 is very small, then the entries in 
the factors involving 5~! are very large. But interchanging the rows of A, we have 


ram A CO aa 

ô 1l ô 1l 1— ô 

which is valid for all ô # 1. Algorithm 3.2 presents a basic column LU factorization 
scheme for nonsingular A. The interchanging of rows at each elimination step to 
select the entry of largest absolute value in its column as the next pivot is called 
partial pivoting. It avoids small pivots and results in an LU factorization of a row 
permuted matrix P A in which the absolute value of each entry of L is at most 1. In 


practice, partial pivoting (or another pivoting strategy) is incorporated into all LU 
factorization variants. Pivoting strategies are discussed in Chapter 7. 
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3.1.3 Factorizations by Bordering 


The generic LU factorization scheme does not cover all possible approaches. An 
alternative is factorization by bordering. Set all diagonal entries of L to 1, and 
assume the first k — 1 rows of L and first k — 1 columns of U (1 < k < n) have been 
computed (that is, L1:k—1,1:x—-1 and U1:k—1,1:k—1). At step k, the factors must satisfy 


Aye Ges a) = TA a (Goran pu 
Da Ak 1:k—1 akk Leik-1 1 0 Ukk 


Equating terms, the lower triangular part of row k of L and the upper triangular part 
of column k of U are obtained by solving 


Lk, 1:k—-1U1:k—1,1:k-1 = Ak 1:k-1, 


Li:k—1,1:k—1U1:k—1,k = Á1:k—1,k- 
The diagonal entry ugķ is then given by 


Ukk = akk — Lk,1:k—-1U1:k—-1,k (with u11 = a11). 


3.2 Fill-in in Sparse Gaussian Elimination 


Here we give some simple results that describe fill-in in the matrix factors; strategies 
to limit fill-in will be presented in Chapter 8. We start by looking at the rules that 
establish the positions of the entries in the factors. Assume S{A} is symmetric, 
and consider the elimination graph G* at step k. Its vertices are the n — k + 1 
uneliminated vertices. Its edge set contains the edges in G(A) connecting these 
vertices and additional edges corresponding to filled entries P during the 
first k— 1 elimination steps. The sequence of graphs G! = G (A), G°, ... is generated 
recursively using Parter’s rule: 


To obtain the elimination graph G**' from G*, delete vertex k and add all 
possible edges between vertices that are adjacent to vertex k in G*. 


Denoting G% = (Vk, €*) and G*t! = (V*t!, E*+1), this can be written as 
VHI = VEY fk}, EM! = EF UL, j)li, j € adjge{k}} \ {Gb |i € adjge{k}}. 


If S{A} is nonsymmetric, then the elimination graphs are digraphs and Parter’s rule 
generalizes as follows: 


To obtain the elimination graph ron G*, delete vertex k and add all edges 


(i en J) such that (i g k) and (k KAN J): 
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Figure 3.1 Illustration of Parter’s rule. The original undirected graph G = G! and the elimination 
graph G? that results from eliminating vertex 1 are shown on the left and right, respectively. The 
red dashed lines denote fill edges. The vertices {2, 3, 4} become a clique. 


Q 
G) O= a 
@) 
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Figure 3.2 Illustration of Parter’s rule for a nonsymmetric S{A}. The original digraph G = G! 
and the directed elimination graph G? that results from eliminating vertex 1 are shown on the left 
and right, respectively. The red dashed lines denote fill edges. 


Simple examples are given in Figures 3.1 and 3.2. 

In terms of graph theory, if S{A} is symmetric, then Parter’s rule says that the 
adjacency set of vertex k becomes a clique when k is eliminated. Thus, Gaussian 
elimination systematically generates cliques. As the elimination process progresses, 
cliques grow or more than one clique join to form larger cliques, a process known 
as clique amalgamation. A clique with m vertices has m(m — 1)/2 edges, but it can 
be represented by storing a list of its vertices, without any reference to edges. This 
enables important savings in both storage and data movement to be achieved during 
the symbolic phase of a direct solver. 

The repeated application of Parter’s rule specifies all the edges in G(L + L7): 


(i, j) is an edge of G(L +L") if and only if (i, j) is an edge of G(A) or (i, k) and 
(k, j) are edges of G(L + LT) for some k < i, j. 


This generalizes to a nonsymmetric matrix A and its LU factorization: 
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123 4 5 67 8 123 4 5 67 8 
1 * 1 * * Ok 
2 * * Ox * 2 * k * * 
3 $ + * 3 * * 
4 * Ok Ok 4 x f f 
5ļ|x * * 5 f « f f 
6 * * 6 ff & f 
7 * 7 ko 
8 * Ok * 8 žok f F f o k 


Figure 3.3 Example to illustrate fill-in during the factorization of a symmetric matrix, with the 
eliminations performed in the natural order. S{A} and S{L + LT} are on the left and right, 
respectively, with the corresponding undirected graphs G(A) and G(L + L7). Filled entries in 
L + L" are denoted by f. The red dashed lines in the filled graph G(L + L7) correspond to filled 
entries. 


(i > j) is an edge of the digraph G(L + U) if and only if (i > j) is an edge of the 
digraph G(A) or (i —> k) and (k —> j) are edges of G(L + U) for some k < i, j. 


Parter’s rule is a local rule that uses the dependency on nonzeros obtained 
in previous steps of the factorization. The following result, which uses the path 
notation of Section 2.2, fully characterizes the nonzero entries in the factors using 
only paths in G(A). 


Theorem 3.1 (Rose et al. 1976; Rose & Tarjan 1978) 


(a) Let S{A} be symmetric and A = LL’. Then (L+ L’);; Æ 0 if and only if 
there is a fill-path i L2, J: 


min 


(b) Let S{A} be nonsymmetric and A = LU. Then (L + U);; 4 0 if and only if 


there is a fill-path i aes j. 


min 


The fill-paths may not be unique. 


Figure 3.3 illustrates Theorem 3.1 for symmetric S{A}. There is a filled entry in 


position (8, 6) of L because there is a fill-path 8 L, 6 given by the sequence 


min 


of (undirected) edges 8 <—> 2 <—> 5 <> 1 <> 6. 


Corollary 3.2 characterizes edges of G* in terms of reachable sets in the original 
graph G(A). 
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Figure 3.4 An example to illustrate reachable sets in G(A). The grey vertices 1, 2, and 3 are 
eliminated in the first three elimination steps (V4 ={1,2,3). 


Corollary 3.2 (Rose et al., 1976; George & Liu, 1980b) 

Assume S{A} is symmetric. Let V* be the set of k — 1 vertices of G(A) that have 
already been eliminated, and let v be a vertex in the elimination graph G*. Then the 
set of vertices adjacent to v in GF is the set Reach(v, V) of vertices reachable from 
v through V¥ in G(A). 


Proof The proof is by induction on k. The result holds trivially for k = 1 because 
Reach(v, V!) = ad jg a){v}. Assume the result holds for G!,...,G* with k > 1, 
and let v be a vertex in the graph G*+! that is obtained after eliminating vg from G*. 
If v is not adjacent to vz in G*, then Reach(v, Vk+1) = Reach(v, V*). Otherwise, 
if v is adjacent to vx in G*, then adjgx+i{v} = Reach(v, VŽ) U Reach(v,, V‘). In 
both cases, Parter’s rule implies that the new adjacency set is exactly equal to the 
vertices that are reachable from v through V‘+!, that is, Reach(v, Vk+!), Oo 


Figure 3.4 depicts a graph G(A). The adjacency sets of the vertices in G4 that 
result from eliminating vertices V4 = {1, 2, 3} are ad igst4} = Reach(4, V4) = 
{5}, adjga{5} = Reach, v*) = {4,6,7}, adjga{6} = Reach(6, v*) = {5,7}, 
adjga{7} = Reach(7, V+) = {5, 6, 8}, and adjgs{8} = Reach(8, V*) = {7}. 

We remark that neither the local characterization of filled entries using Parter’s 
rule nor Theorem 3.1 provides a direct answer as to whether a certain edge belongs 
toG(L+L T) (or G(L+U)); without performing the eliminations, they do not tell us 
whether a given entry of a factor of A is nonzero. Such questions are addressed by 
deeper theoretical and algorithmic results that are presented in subsequent chapters. 


3.3 Triangular Solves 


Once an LU factorization has been computed, the solution x of the linear system 
Ax = b is computed by solving the lower triangular system 


Lyý=b, (3.3) 
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followed by the upper triangular system 
Ux = y. (3.4) 


Solving a system with a triangular matrix and dense right-hand side vector is 
straightforward. The solution of (3.3) can be computed using forward substitution 
in which the component yı is determined from the first equation, substitute it into 
the second equation to obtain y2, and so on. Once y is available, the solution of (3.4) 
can be obtained by back substitution in which the last equation is used to obtain xn, 
which is then substituted into equation n— 1 to obtain xn—1, and so on. Algorithm 3.3 
is a simple lower triangular solve for dense b. If L is unit lower triangular, step 3 is 
not needed. 


ALGORITHM 3.3 Forward substitution: lower triangular solve Ly = b with b 
dense 

Input: Lower triangular matrix L with nonzero diagonal entries and dense right- 
hand side b. 

Output: The dense solution vector y. 


1: Initialise y = b 
2: for j =1:ndo 
35 yp =V;/ljj 


4 fori = j+1:ndo 
5 if lij Æ 0 then 

6: Yi = Yi — lijyj 
7 end if 

8 end for 

9: end for 


When b is sparse, the solution y is also sparse. In particular, if in Algorithm 3.3 
yk = 0, then the outer loop with j = k can be skipped. Furthermore, if b} = b2 = 
... = bg = 0 and bay Æ 0, then yı = yo = ... = yy = 0. Scanning y to check 
for zeros adds O (n) to the complexity. But if the set of indices 7 = {j | yj # 0} is 
known beforehand, then Algorithm 3.3 can be replaced by Algorithm 3.4. A possible 
way to determine J is discussed later (Theorem 5.2). 

Note that the combined effect of forward substitution (3.3) followed by back 
substitution (3.4) often results in the final solution vector x being dense. This is the 
case if y, 4 0 and U has an entry in each off-diagonal row i (1 <i < n). 


3.4 Reducibility and Block Triangular Forms 


The performance of algorithms for computing factorizations of sparse matrices can 
frequently be significantly enhanced by first permuting A to have a block form or by 
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ALGORITHM 3.4 Forward substitution: lower triangular solve Ly = b with b 
sparse 

Input: Lower triangular matrix L with nonzero diagonal entries, sparse vector b and 
the set 7 of indices j such that y; 4 0. 

Output: The sparse solution vector y. 


1: Initialise y = b 

2: for j € J do > Take indices from 7 in increasing order 
32 yg = g/l ij 

4 fori = j+1:ndo 
5 if /;; 4 0 then 
6: yi = yi — lijyj 
7 end if 
8 end for 
9: end for 


partitioning A into blocks. Permuting to block form is closely connected to matrix 
reducibility. A is said to be reducible if there is a permutation matrix P such that 


PAP! = (a =) f 
0 Apo. pr 


where Ap,,p, and Ap,, p, are nontrivial square matrices (that is, they are of order at 
least 1). If A is not reducible, it is irreducible. If A is structurally symmetric, then 
Ap,,p. = 0 and PAP?" is block diagonal. The following example illustrates that a 
one-sided permutation can transform an irreducible matrix A into a reducible matrix 


AQ. 


1 1 1 1 
A=]|{1 1 , Q= 1 , AQ= 1 1 
1 


A matrix A is said to be a Hall matrix (or has the Hall property) if every set of k 
columns has nonzeros in at least k rows (1 < k < n). A is a strong Hall matrix (or 
has the strong Hall property) if every set of k columns (1 < k < n) has nonzeros 
in at least k + 1 rows. The strong Hall property trivially implies the Hall property. 
The Hall property applies to rectangular m x n matrices with m > n. If A is square, 
then A has the strong Hall property if and only if the directed graph G(A) is strongly 
connected. 

The following theorem is an important consequence of reducibility. 


Theorem 3.3 (Brualdi & Ryser 1991) 
Given a nonsingular nonsymmetric matrix A, there exists a permutation matrix P 
such that 
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Figure 3.5 The sparsity patterns of A (left) and the upper block triangular form PAP? with two 
blocks Ajip ib, i = 1, 2, of orders 2 and 4 (right). 


Ait Ai > Alp 

7 O A2 >> A2nb 
PAP’ = . ; ; : r (3.5) 

0 0 as Anb,nb 


where the square matrices Ajpj, on the diagonal are irreducible. The set 
{Aip,ip| 1 < ib < nb} is uniquely determined (but the blocks may appear on 
the diagonal in a different order). The order of the rows and columns within each 
Aib ib may not be unique. 


The upper block triangular form (3.5) is also known as the Frobenius normal 
form. It is said to be nontrivial if nb > 1, and this is the case if A does not have the 
strong Hall property. An example of a matrix that can be symmetrically permuted 
to block triangular form with nb = 2 is given in Figure 3.5. 

In practice, many of the blocks in (3.5) are either sparse or zero blocks. Assuming 
the blocks A;p ip on the diagonal are all nonsingular, an LU factorization of each can 
be computed independently. These can then be used to solve the permuted system 
PAP" y = c as a sequence of nb smaller problems, as outlined in Algorithm 3.5. 
The solution of the original system Ax = b follows by setting c = Pb and x = 
P7 y. Because the algorithms used to transform A into a block triangular form are 
typically graph-based (and do not use the numerical values of the entries of A), 
pivoting needs to be incorporated within the factorization of the diagonal blocks. 
Algorithm 3.5 employs partial pivoting for this. 

The transversal of a matrix A is the set of its nonzero diagonal elements. A 
has a full or maximum transversal if all its diagonal entries are nonzero. There 
exist permutation matrices P and Q such that PAQ has a full transversal matrix 
if and only if A has the Hall property. Moreover, if A is nonsingular, then it can 
be nonsymmetrically permuted to have a full transversal. However, the converse 
is clearly not true (for example, a matrix with all its entries equal to one has a 
full transversal, but it is singular). Permuting A to have a full transversal will be 
discussed in Section 6.3. 

If A has a full transversal, then there exists a permutation matrix P, such 
that PAP? has the form (3.5). In other words, once A has a full transversal, a 
symmetric permutation is sufficient to obtain the form (3.5). Finding P, is identical 
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ALGORITHM 3.5 Solve a sparse linear system in upper block triangular form 
Input: Upper block triangular matrix (3.5) and a conformally partitioned right-hand 
side vector c. 

Output: The conformally partitioned solution vector y. 


1: for ib = 1 : nb do > LU factorizations of the Aip ip blocks can be performed 
in parallel 

2: Compute P;ipAib ib = LibUip > Sparse LU factorization with partial 

pivoting 

3: end for 

4: Solve LnpUnb Ynb = PnbCnb > Perform forward and back substitutions 

5: for ib = nb — 1 : 1 do 

6 for jb = ib + 1 : nb do 

7 


Cib = Cib — Aib, jbY jb > Sparse matrix-vector operation (skip if 
Aib, jb = 9) 
8: end for 
9: Solve LipVin vip = PibCib > Perform forward and back substitutions 
10: end for 


to finding the strongly connected components (SCCs) of the digraph G(A) = (V, €) 
(Section 2.3). To find the SCCs, V is partitioned into non-empty subsets V; with 
each vertex belonging to exactly one subset. Each vertex i in the quotient graph 
corresponds to a subset V;, and there is an edge in the quotient graph with endpoints 
i and j if E contains at least one edge with one endpoint in V; and the other in V;. 
The condensation (or component graph) of a digraph is a quotient graph in which 
the SCCs form the subsets of the partition, that is, each SCC is contracted to a 
single vertex. This reduction provides a simplified view of the connectivity between 
components. An example is given in Figure 3.6. It has five SCCs: {p, q,r}, {s, t, u}, 
{v}, {w}, and {x}. 
The following result gives the relationship between SCCs and DAGs. 


Theorem 3.4 (Sharir 1981; Cormen et al. 2009) 
The condensation Gc of a digraph is a DAG (directed acyclic graph). 


Because any DAG can be topologically ordered, Gc = (Vc, €c) can be 
topologically ordered, and if V; and V; are contracted to s; and sj and (s; — sj) 
€ Ec, then s; < sj. It follows that to permute A to block triangular form it is 
sufficient to find the SCCs of G(A). That is, topologically ordering the vertices of 
the condensation Gc induced by the SCCs is the quotient graph that implies the 
block triangular form. There are many ways to find SCCs, one of which is Tarjan’s 
algorithm (Algorithm 3.6). The key idea here is that vertices of an SCC form a 
subtree in the DFS spanning tree of the graph. The algorithm performs depth- 
first searches, keeping track of two properties for each vertex v: when v was first 
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Figure 3.6 An illustration of the strong components of a digraph. On the left, the five SCCs are 
denoted using different colours and on the right is the condensation DAG Gc formed by the SCCs. 


encountered (held in invorder (v)) and the lowest numbered vertex that is reachable 
from v (called the low-link value and held in Jowlink(v)). It pushes vertices onto 
a stack as it goes and outputs a SCC when it finds a vertex for which invorder (v) 
and Jowlink(v) are the same. The value Jowlink(v) is computed during the DFS 
from v, as this finds the vertices that are reachable from v. 

In Algorithm 3.6, the variable index is the DFS vertex number counter that 
is incremented when an unvisited vertex is visited. S is the vertex stack. It is 
initially empty and is used to store the history of visited vertices that are not yet 
committed to an SCC. Vertices are added to the stack in the order in which they 
are visited. The outermost loop of the algorithm visits each vertex that has not 
yet been visited, ensuring vertices that are not reachable from the starting vertex 
are eventually visited. The recursive function scomp_step performs a single DFS, 
finding all descendants of vertex v, and reporting all SCCs for that subgraph. When 
a vertex v finishes recursing, if Jowlink(v) = invorder (v), then it is the root vertex 
of an SCC comprising all of the vertices above it on the stack. The algorithm pops 
the stack up to and including v; these popped vertices form an SCC. The algorithm 
is linear in the number of edges and vertices, that is, it is of complexity O(|V|+|E|). 


3.5 Block Partitioning 


In this section, we assume that S{A} is symmetric and G = (V, €) is the adjacency 
graph of A. 
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ALGORITHM 3.6 Tarjan’s algorithm to find the strongly connected compo- 
nents (SCCs) of a digraph 

Input: Digraph G = (V, £). 

Output: Strongly connected components of G, determined one-by-one. 


1: Vy = Ø, S = 0, index = 0, > Each vertex is initially unvisited 
2: for each v € VY do 

3: if v Z V, then 

4 scomp_step(v) 

5 end if 

6: end for 

7: recursive function (scomp_step(v)) 
8 

9 


Vy = Vy U {v} > Add v to the set of visited vertices 
index = index + 1 > Set the index for v to smallest unused index 
10: invorder(v) = index, lowlink(v) = index 
11: push(S, v) > Put v on the stack 
12: Set v = head (S) œ v is the current head of S. 
13: for each (v > w) € E do > Look in the adjacency list of v 
14: if w ¢ V, then > w not yet been visited; recurse on it 
15: scomp_step(w) 
16: lowlink(v) = min(lowlink(v), lowlink(w)) 
17: else if w € S then > w is in the stack and hence in current SCC 
18: lowlink(v) = min(lowlink(v), invorder(w)) 
19: end if 
20: end for 
21: if Jowlink(v) = invorder (v) then 
22: pop all vertices down to v from S to obtain a new SCC 
23: end if 


24: end recursive function 


3.5.1 Block Structure Based on Supervariables 


Sets of columns of A frequently have identical sparsity patterns. For instance, when 
A arises from a finite element discretization, the columns corresponding to variables 
that belong to the same set of finite elements have the same pattern, and this occurs 
as a result of each node of the finite element mesh having multiple degrees of 
freedom associated with it. This repetition of the sparsity patterns can be used to 
substantially enhance performance. 

Adjacent vertices u and v in an undirected graph G = (V, E) are said to be 
indistinguishable if they have the same neighbours, that is, adjg{u} U {u} = 


3.5 Block Partitioning 47 


adjg{v} U {v}. A set of mutually indistinguishable vertices is called an indistin- 
guishable vertex set. If 2/ C V is an indistinguishable vertex set, then U is maximal 
if U U {w} is not indistinguishable for any w € V \ U. 

Indistinguishability is an equivalence relation on V, and maximal indistinguish- 
able vertex sets represent its classes. This implies a partitioning of V into nsup > 1 
non-empty disjoint subsets 


V =V UVU... U Vasup- (3.6) 


An indistinguishable vertex set can be represented by a single vertex, called a 
supervariable. 

If the vertices belonging to each subset Vj,..., Vnsup are numbered consecu- 
tively, with those in V; preceding those in Vj4; (1 < i < nsup), and if P is the 
permutation matrix corresponding to this ordering, then the permuted matrix P A PT 
has a block structure in which the blocks are dense (with the possible exception of 
the diagonal entries, which can be zero); the dimensions of the blocks are equal to 
the sizes of the indistinguishable sets. 

One approach for identifying supervariables is outlined in Algorithm 3.7. 
Initially, all the vertices are placed in a single vertex set (that is, into a single 
supervariable). This is split into two supervariables by taking the first vertex 
j = 1 and moving vertices in the adjacency set of j into a new vertex set (a 
new supervariable). Each vertex j is considered in turn, and each vertex set Vy, 
that contains a vertex in adjg{j} U j is split into two by moving the vertices in 
adjg{j}U j that belong to Vs, into a new vertex set. Note that as a result of the 
splitting and moving of vertices, a vertex set can become empty, in which case it 
is discarded. Once the supervariables have been determined, the permuted matrix 
PAP? can be condensed to a matrix of order equal to nsup; the corresponding 
graph is called the supervariable graph. If the average number of variables in each 
supervariable is k, using the supervariable graph will reduce the amount of integer 
data that is read during the symbolic phase by a factor of about k*. 

As an illustration, consider the following 5 x 5 matrix 


12 3 455 


nk WN & 
* 
* 
* 


Initially, 1,2,3,4,5 are put into a single vertex set Vı. Consider j = 1. Vertices 
i = 1, 2 and 5 belong to adjg{1}U{1}; they are moved from V; into a new vertex set. 
There is no further splitting of the vertex sets for j = 2. For j = 3, adjg{3}U {3} = 
{3, 4, 5}. Vertices i = 3 and 4 are moved from V; into a new vertex set. V; is now 
empty and can be discarded. Vertex i = 5 is moved from the vertex set that holds 
vertices 1 and 2 into a new vertex set. For j = 4 and 5, no additional splitting is 
performed. Thus, three supervariables are found, namely {1, 2}, {3, 4}, and {5}. 


48 3 Introduction to Matrix Factorizations 


ALGORITHM 3.7 Find the supervariables of an undirected graph 
Input: Graph G of a symmetrically structured matrix. 
Output: Partitioning of V into indistinguishable vertex sets. 


1: Vy = {1,2,...,n} 
2: for j = 1 : n do 
3: for i € adjg{j}U j do 


4 Find sv such that i € Vyy 
5 if this is the first occurrence of sv for the current index j then 
6: Establish a new vertex set V,,,, and move i from V,, to Vnsv 
7 else 
8 Move i from V5, to Vasy 
9 end if 
10: Discard V,» if it is empty 
11: end for 
12: end for 


3.5.2 Block Structure Using Symbolic Dot Products 


An alternative way to find a block structure uses symbolic dot products between the 
rows of the matrix. While fully dense blocks can be found this way, it can also be 
used to determine an approximate block structure in which blocks are classified as 
dense or sparse based on a chosen threshold; this can be useful in preconditioning 
iterative methods. Although we assume that S{A} is symmetric, modifications can 
extend the approach to general nonsymmetric A. 

Rewrite A as row vectors 

A= lal ryt T _ 4. 
= (al ; <.. af) , where a; = Åi 1:n, 

and consider G(A) = (V, E). A partition V = Vj U ... U Vap is constructed 
using row products al ax between different rows of A. These express the level 
of orthogonality between the rows; if al ak is small, then i and k are assigned to 
different vertex sets. Algorithm 3.8 treats all entries of A as unity, and the symbolic 
row products can be considered as a generalization of the angles between rows 
expressed by their cosines, hence the notation cosine for the vector that stores 
these products. The vertex sets are described using the vector adjmap. On output, 
if adjmap(ij) = adjmap(iz), then vertices i; and i2 belong to the same vertex 
set. Symmetry of S{A} simplifies the computation of the symbolic row products 
because for row i only k > i is considered, that is, only the symbolic row products 
that correspond to one triangle of AT A are checked. 

The procedure outlined in Algorithm 3.8 and illustrated in Figure 3.7 is con- 
trolled by a threshold parameter t € (0, 1]. j is added to the subset to which i 
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ALGORITHM 3.8 Find approximately indistinguishable vertex sets in an undi- 
rected graph 

Input: Graph G = (V, E) of a symmetrically structured matrix A, the number nz; 
of entries in row į of A (1 <i < n), anda threshold parameter t € (0, 1]. 

Output: Partitioning of V into nb disjoint approximately indistinguishable vertex 
sets. 


1: nb = 0, adjmap(1 : n) = 0, cosine(1 : n) = 0 
2: fori = 1 : n do 
3: if adjmap(i) = 0 then 


4 nb=nb+1 > Start a new set 
5: adjmap(i) =ib 
6 for (i, j) € Edo > Corresponds to an entry in Åj 1:n 
7 for (k, j) € E with k >i do b> Both rows į and k have an entry in 
column j 
8: if adjmap(k) = 0 then > k has not been yet added to some 
partitioning set 
9: cosine(k) = cosine(k) +1 > Increase partial dot product 
10: end if 
11: end for 
12: for k with cosine(k) 4 0 do 
13: if cosine(k)? > t? *nz; *nz;, then > Test similarity of row 
patterns 
14: adjmap(k) = nb 
15: end if 
16: cosine(k) = 0 
17: end for 
18: end for 
19: end if 
20: end for 


belongs if the cosine of the angle between them exceeds t. If t < 1, the block 
structure depends on the order in which the rows are processed, while t = 1 gives 
the exact indistinguishable vertex sets because, in this case, the row patterns being 
compared must be the identical for the rows to be assigned to the same set. 


3.6 Notes and References 


A standard description of LU factorizations based on the generic scheme given in 
Algorithm 3.1 can be found in the classical book by Ortega (1988b); this includes the 
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Figure 3.7 An example to illustrate Algorithm 3.8. The original matrix is given (left) together 
with the permuted matrix with indistinguishable vertex sets V = {1, 3} U{2, 6}U {4} U {5} obtained 
using T = 1 (centre) and the permuted matrix with approximately indistinguishable vertex sets 
V = {1, 3, 5} U {2, 6} U {4} obtained using t = 0.5 (right). The threshold t = 0.5 results in putting 
row 5 into the same set as row 1, making the vertex sets only approximately indistinguishable. The 
permuted matrix on the right has an approximate block form. 


symmetric case and discusses early parallelization issues (which are also considered 
in the review of Dongarra et al. (1984)). A more algorithmically oriented approach is 
given in Golub & Van Loan (1996). For the column variant with partial pivoting, we 
recommend the detailed description of the sparse case in Gilbert & Peierls (1988). 
Many results for sparse LU factorizations are surveyed by Gilbert & Ng (1993) and 
Gilbert (1994). Pothen & Toledo (2004) consider both symmetric and nonsymmetric 
matrices in their survey of graph models of sparse elimination. The review by Davis 
et al. (2016) provides many further references. 

Parter (1961) presents Parter’s rule, and its nonsymmetric version is given in 
Haskins & Rose (1973). Building on the paper of Rose et al. (1976), Rose & Tarjan 
(1978) were the first to methodically consider the symbolic structure of Gaussian 
elimination for nonsymmetric matrices. Related work is included in the seminal 
paper on Cholesky factorizations by Liu (1986). Fill-in rules in the general context 
of Schur complements in LU factorizations can be found in Eisenstat & Liu (1993b). 

Classical and detailed treatments of triangular solves that also cover sparse issues 
are given in the papers Brayton et al. (1970), Gilbert & Peierls (1988), and Gilbert 
(1994). For reducibility theory that is closely connected to the general theory of 
matrices, see Brualdi & Ryser (1991), which includes, for example, a proof of 
Theorem 3.4. 

Algorithm 3.6 for computing strongly connected components of a digraph is 
introduced in Tarjan (1972); see also Sharir (1981) and Duff & Reid (1978) for 
an early implementation. 

For identifying supervariables, Algorithm 3.7 follows Reid & Scott (1999), but 
see also Ashcraft (1995) and Hogg & Scott (2013a) (the latter presents an efficient 
variant that employs a stack). The approximate block partitioning of Section 3.5.2 
is from the paper by Saad (2003a), which also describes some modifications of the 
basic approach; more sophisticated schemes with overlapping blocks are given in 
Fritzsche et al. (2013). 
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Chapter 4 


Sparse Cholesky Solver: The Symbolic P 
Phase 


The modern view of numerical linear algebra as being to a large 
extent the study and systematic use of matrix decompositions 
has certainly been influenced by Cholesky’s posthumously 
published work — Benzi (2017). 


This chapter focuses on the symbolic phase of a sparse Cholesky solver. The sparsity 
pattern S{ A} of the symmetric positive definite (SPD) matrix A is used to determine 
the nonzero structure of the Cholesky factor L without computing the numerical 
values of the nonzeros. The subsequent numerical factorization is discussed in the 
next chapter. Because the symbolic phase works only with S{A} (the values of the 
entries of A are not considered), it is also used for symmetric indefinite matrices 
and sometimes within LU factorizations of symmetrically structured nonsymmetric 
problems. It is implicitly assumed that all the diagonal entries of A are included in 
S{A} (even if they are zero). During the factorization phase, it may be necessary to 
amend the data structures to allow for indefiniteness. This makes the factorization of 
indefinite matrices potentially more expensive and more complex; this is considered 
further in Chapter 7. 

A fundamental difference between dense and sparse Cholesky factorizations is 
that, in the latter, each column of L depends on only a subset of the previous 
columns. The elimination tree is a data structure that describes the dependencies 
among the columns of A during its factorization. A key result that assists in 
the understanding of sparse Cholesky factorizations is that the sparsity pattern of 
column j of L is the union of the pattern of column j of the lower triangular part 
of A and the patterns of the children of j in the elimination tree; this is shown in 
Section 4.3. Furthermore, the fact that disjoint parts of the elimination tree can be 
factored independently offers the potential for high-level tree-based parallelism that 
does not exist for dense matrices. 
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54 4 Sparse Cholesky Solver: The Symbolic Phase 
4.1 Column Replication Principle 


We begin by looking at how the sparsity pattern of a computed column of L 
influences the patterns of subsequent Schur complements. From (3.2), the Schur 
complement S% can be written as 


kat [lki 
s® = Agnjken — > : (lxj tas Inj) : (4.1) 
j=l lg 


Consider column j of L (1 < j < k — 1), and let l;; # 0 for some i > j. The 
involvement of /;; in the outer product in (4.1) allows the following observation. 


Observation 4.1 For anyi > j > 1 such that lij 40 
S{Lin, j} c S{Lin,i}- (4.2) 


This is called the column replication principle because the pattern of column j of 
L (rows i to n) is replicated in the pattern of column i of L. 


Denote the row index of the first subdiagonal nonzero entry in column j of L by 
parent (j), that is, 


parent(j) = min{i | i > j and l;; 4 0}. (4.3) 


If there is no such entry, set parent (j) = 0. The row index parent (parent (j)) is 
denoted by parent? (j), and so on. Applying column replication recursively implies 
the sparsity pattern of column j of L is replicated in that of column parent (j), 
which in turn is replicated in the pattern of column parent*(j), and so on. This 
is illustrated in Figure 4.1. Here j = 1, and because the first subdiagonal entry in 
column 1 is in row 3, parent(1) = 3. Likewise, parent (3) = parent?(1) =5. 


1 2.3 4 3 6 7 123 45 6 7 123 4 5 67 
1 /x l fe 1 /x 
2 * 2 * 2 $ 
3 | * * 3] * * 3 | * x 
4 * * 4 * * 4 * * 
5 * * 5 * * 5 * * 
6 | x* * * * 6 |e & ff «* * 6l * y =» fo 
qe ee ie 7 \* $ * o T- NE f f * 


Figure 4.1 An illustration of column replication. On the left are the entries in L before step 1 of a 
Cholesky factorization (that is, the entries in the lower triangular part of A); in the centre, we show 
the replication of the nonzeros from column 1 in the pattern of column parent (1) = 3 (red entries 
f); on the right, we show the subsequent replication in column parent?(1) = 5. 
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The following result shows that, provided A is irreducible, the mapping 
parent(j) has nonzero values given by (4.3) for all j < n. 


Theorem 4.1 (Liu 1986) Jf A is SPD and irreducible, then in each column j (1 < 
j < n) of its Cholesky factor L there exists an entry li; # O withi > j. 


Proof From Parter’s rule, each step of the Cholesky factorization corresponds to 
adding new edges into the graph of the corresponding Schur complement. If A is 
irreducible, then the graphs corresponding to the Schur complements are connected. 
Consequently, for any vertex j (1 < j <n) in any of these graphs, there is at least 
one vertex i with i > j to which j is connected. This corresponds to the nonzero 
entry in column j of L. o 


With the convention parent! (j) = parent(j), the next theorem shows that 


if entry /;; of L is nonzero, then parent'(j) = i for some t > 1, and there 
is an entry in row i of L in each of the columns in the replication sequence 
j, parent! (j), parent?(j), ..., parent'(j). 


Theorem 4.2 (Liu 1990; George 1998) Let A be SPD, and let L be its Cholesky 
factor. If lij # 0 for some j <i < n, then there exists t > 1 such that parent'(j) = 
i and lip # O fork = j, parent! (j), parent?(j), ..., parent (j). 


Proof Ifi = parent'(j), the result is immediate. Otherwise, there exists an index 
k, j < k < i of a subdiagonal entry in column j of L such that k = parent! (j). 
Column replication implies /;, + 0. Applying an inductive argument to lig, the 
result follows after a finite number of steps. o 


If there is a sequence of nonzeros in a row of L, it is natural to ask where the 
sequence begins. It is straightforward to see if there is no k > 1 such that aj, 4 0, 
no replication of nonzeros can start in row i. The main result on the replication of 
nonzeros of A is summarized as Theorem 4.3. 


Theorem 4.3 (Liu 1986) Let A be SPD, and let L be its Cholesky factor. If a;; = 0 
for some 1 < j <i <n, then there is a filled entry li; 4 0 if and only if there exist 
k < j andt > 1 such that aj, £ 0 and parent! (k) = j. 


4.2 Elimination Trees 


The discussion of column replication is significantly simplified using elimination 
trees. The elimination tree (or etree) 7 (A) (or simply 7) of an SPD matrix 
has vertices 1,2,...,m and an edge between each pair (j, parent(j)), where 
parent(j) is given by (4.3); j is a root vertex of the tree if parent(j) = 0. The 
edges of 7 are considered to be directed from a child to its parent, that is, 


E(T) ={G — i) |i = parent(j)}. 
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Figure 4.2 An illustration of a sparse matrix A with a symmetric sparsity pattern and its 
elimination tree 7(A). The root vertex is 8. The filled entries in S{L + L7} are denoted by f. 


If 7 has a single component, then the root vertex is n. Despite the terminology, the 
elimination tree need not be connected and in general is a forest. For simplicity, 
in our discussions, we assume 7 has a single component, and we say that 7 is 
described by the vector parent. 

An example of a matrix and its elimination tree is given in Figure 4.2. Here and 
elsewhere, following conventional notation, directional arrows are omitted from the 
tree plot. 

Concepts such as child, leaf, ancestor, and descendant vertices introduced in 
Section 2.3 for directed rooted trees can be applied to 7. Additionally, ancy {Jj} 
and desc7{j} are defined to be the sets of ancestors and descendants of vertex j in 
T. We denote by 7 (j) the subtree of 7 induced by j and descy{j); j is the root 
vertex of 7 (j). The size |7 (j)| is the number of vertices in the subtree. A pruned 
subtree of 7 (j) is the connected subgraph induced by j and a subset of descy{/). 
That is, for any vertex i in a pruned subtree of 7 (j), all the ancestors of i belong to 
the pruned subtree. A pruned subtree of 7 shares the mapping parent with T. 

The following observation is straightforward. 


Observation 4.2 [fi € ancy{j} for some j + i, theni > j. 


The connection between the mapping parent and the sets of ancestors and 
descendants is emphasized by the next observation. 


Observation 4.3 [fi and j are vertices of the elimination tree T with j <i < n, 
then 


icancņ{j} if and only if j € descy{i} if and only if parent’ (j)= i for some t >1. 


The results in Section 4.1 can be expressed using rooted trees. Consider, for 
example, Theorem 4.2. Instead of stating that there exists £ > 1 such that 
parent'(j) = i, we can write that i € ancy{j}. Rewriting Theorem 4.3 as the 
following corollary provides a clear characterization of the sparsity patterns of the 
rows of L. 
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Figure 4.3 The row subtree 7, (5) of the elimination tree 7 from Figure 4.2 (left). Vertex 3 has 
been pruned because a35 = 0. The row subtree 7;(8) (right) differs from T = 7 (A) because 
vertex 1 has been pruned (aig = 0). 


Corollary 4.4 (Liu 1986) Consider the elimination tree T and the Cholesky factor 
L of A. Ifi and j are vertices of T with j < i < n and aj; = 0, then l;i; 4 0 if and 
only if there exists k < j such that j € ancy (k) and aig # 0. 


The subtree of 7 with vertices that correspond to the nonzeros of row i of L is 
called the i-th row subtree and is denoted by 7, (i). Formally, it is a pruned subtree 
of T induced by the union of the vertex set 


{i} U {k | aik #0 and k < i} 


with all vertices on the directed paths in 7 from k to i, that is, with all their ancestors 
from 7,(i). The root vertex is i, and the leaf vertices are a subset of the column 
indices in the i-th row of the lower triangular part of A. Figure 4.3 illustrates row 
subtrees for the matrix and elimination tree from Figure 4.2. Note that row subtrees 
are connected subgraphs of 7, even if 7 is not connected. If 7 can be found without 
determining the pattern of L, then 7; (i) can be used to derive the sparsity pattern of 
row i of L, without having to store each entry explicitly. 

Theorem 4.5 characterizes the ancestors of a given vertex j using paths in G(A). 
The proof helps clarify the relationship between 7 and paths in G(A). 


Theorem 4.5 (Schreiber 1982; Liu 1986) Ifi and j are vertices in the elimination 
tree T with j <i <n, theni € ancy{j} if and only if there exists a path 


joi. (4.4) 


Proof Assume i € ancy{j}. Then there is a path j L i of length l > 1. Each 
edge of this path belongs to G (L) and corresponds either to an edge in G (A) or toa 
fill-path in G (A). Connecting these paths together gives (4.4). 
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Conversely, if the path (4.4) exists, induction on its length can be used to prove 
the result. If the path is of length 1, then the result holds because i and j are 
connected in G(A) by an edge. Consequently, from Theorem 4.2, i is an ancestor 
of j. Now assume that the result is true for all paths of length less than / (J > 1), 
and consider a path of length /. Let m be the largest vertex on this path. If m < j, 
then (4.4) is a fill-path connecting i and j and, therefore, i € ancy {j}. Otherwise, 
form > j, the assumption implies i € ancy{m}U{m} and m € ancy{j}U {J}, that 
is, i € ancy {J}. o 


Given a vertex j in 7, the following corollary indicates how to find parent (j) 
(if it exists). If the set of ancestors of j is non-empty, then the lowest numbered one 
is its parent. 


Corollary 4.6 (Liu 1986, 1990) Vertex i is the parent of vertex j in T if and only if 
i is the lowest numbered vertex satisfying j < i < n for which there is a path (4.4). 


The existence of (4.4) is equivalent to requiring i and j belong to the same 
component of the graph G(A1.;,1.;) corresponding to the i x i principal leading 
submatrix A.;,1.; of A. Figure 4.4 depicts G (A) for the matrix A given in Figure 4.2. 
Consider vertex 4. Its set of ancestors for which paths from Theorem 4.5 exist 
comprises vertices 5, 6, and 8. Vertex 7 is not an ancestor of 4 because there is 
no path from 7 to 4 in the graph G(Aj.7,1.7). Among the ancestors of 4, vertex 5 
fulfils the condition from Corollary 4.6 and is thus the parent of 4. 

T = T(A) can be constructed by stepwise extensions of the elimination trees 
of the principal leading submatrices of A. Assume we have 7 (A}.j—1,1:;-1) and we 
want to construct 7 (A1.i,1.;). Initialize T(A1-i,1::) = 7 (At-i—1,1i1-1). If there are 
no entries in row i of A to the left of the diagonal, then there is nothing to do, 
and only an isolated vertex i is added. Otherwise, i is the root of the row subtree 
T,(i) and an ancestor of some vertex j in 7. The ancestors k of j with k < i are 
in 7 (A1:i-1,1:i-1). Because row subtrees are connected subgraphs of 7, a directed 
path in 7(A1:;,1:;) with parent'(j) = i exists for some t > 1. The search for 
this path starts from jroot = j and continues, while parent(jroot) + 0 and 
parent(jroot) ¥ i, using a sequence of assignments jroot = parent(jroot). It 
terminates once parent (jroot) = i or i is found to have already been added when 
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Figure 4.4 The graph G(A) of the matrix from Figure 4.2 illustrating Theorem 4.5 and Corol- 
lary 4.6. 
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tracing the path from another entry j’ in row i. The construction of T is summarized 
in Algorithm 4.1. 


ALGORITHM 4.1 Construction of an elimination tree 
Input: A with a symmetric sparsity pattern and its undirected graph G. 
Output: Elimination tree 7 described by the vector parent. 


1: for i = 1 : n do > Loop over the rows of A 

2; parent(i) =0 > Initialisation 

3: for j € adjg{i} and j < i do > Loop over the below diagonal entries in 
row i 

4: jroot = j 

5: while parent(jroot) #0 and parent(jroot) 4i do > Find the 


current root 


6: jroot = parent (jroot) 

7: end while 

8: if parent(jroot) = 0 then 

9: parent(jroot) =i > Make i the parent of jroot 
10: end if 

11: end for 

12: end for 


The most expensive part of Algorithm 4.1 is the while loop that searches for 
subtree roots. Because the directed path from j to its root parent'(j) is unique, 
shortcuts can be incorporated; this is called path compression. Having found 
a directed path from j to k, subsequent searches can be made more efficient 
by introducing a vector ancestor and setting ancestor(j) = k. The modified 
algorithm is outlined in Algorithm 4.2. It maintains two structures using the current 
values of parent and ancestor. The tree described by ancestor is termed the 
virtual tree. 

Figure 4.5 shows a matrix for which path compression makes constructing T 
significantly more efficient. For this example, 7 is determined by the mapping 
parent(6) = 0; parent(i) = i+ 1 fori = 1,...,5. The complexity of Algo- 
rithm 4.1 is O(n”), but for this example the complexity of Algorithm 4.2 is O(n). 
Formally, the complexity of Algorithm 4.2 is O(nz(A) logy (n)), where nz(A) is the 
number of nonzeros of A, but the logarithmic factor is rarely reached. Additional 
modifications can reduce the theoretical complexity to O(nz(A) g(nz(A),n)), 
where g(nz(A),n) is a very slowly increasing function called the functional 
inverse of Ackermann’s function. This means that, in practice, the complexity of 
constructing 7, and hence of obtaining an implicit representation of S{L}, is close 
to linear in nz(A) (which in general is much smaller than nz(L)). 
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ALGORITHM 4.2 Construction of an elimination tree using path compression 
Input: A with a symmetric sparsity pattern and its undirected graph G. 
Output: Elimination tree 7 described by the vector parent. 


1: for i=1:ndo > Loop over the rows of A 
2: parent (i) = 0, ancestor (i) = 0 > Initialisation 
3: for j € adjg{i} and j < i do > Loop over the below diagonal entries in 
row i 
4: jroot = j 
5: while ancestor (jroot) + 0 and ancestor(jroot) + i do 
6: l = ancestor (jroot) 
T: ancestor (jroot) =i > Path compression to accelerate future 
searches 
8: jroot =I 
9: end while 
10: if ancestor(jroot) = 0 then 
11: ancestor (jroot) =i and parent(jroot) =i 
12: end if 
13: end for 
14: end for 
ok * OK * OK * 
* * 
Ok OK 
ok * 
ok * 
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Figure 4.5 A sparse matrix for which computing the elimination tree using Algorithm 4.2 is much 
more efficient than using Algorithm 4.1. 


The following simple theorem states that there is no edge in G(L + LT) between 
vertices belonging to subtrees of 7 with different vertex sets. If there was such an 
edge (s, t), then from Theorem 4.2, one of the vertices s and t must be an ancestor 
of the other, which is a contradiction. The importance of this result is that it implies 
that for any such pairs of vertices the corresponding column sparsity patterns in L 
can be computed in parallel. 


Theorem 4.7 (Liu 1990) Consider the elimination tree T and the Cholesky factor 
L of A. Let T(i) and T(j) be two vertex-disjoint subtrees of T. Then for all s € 
T(i) andt € T(j), the entry ls of L is zero. 
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The explicit structure of L is not always required; sometimes only the numbers of 
nonzeros in each row and column of L are needed. For example, when comparing 
the amount of fill-in in the factors for different initial orderings of A, allocating 
factor storage, finding relaxed supernodes (see Section 4.6), and determining load 
balance and synchronization events in parallel factorizations. 

Let row, {i} denote the sparsity pattern of the off-diagonal part of row i of L, 
that is, 


rowi{i} = S{Liti} = li <i, lij #0}, lsisn. 


The number of entries in L is 


nz(L) = > jrowz{i}| +n. 


i=l 


Corollary 4.4 implies row {i} is given by the vertices of the row subtree 7; (i). 
This suggests Algorithm 4.3. Here the vector mark is used to flag vertices so as to 
avoid including them more than once within a row subtree. The complexity of the 
algorithm is O(nz(L)). 


ALGORITHM 4.3 Computation of the row sparsity patterns of the Cholesky 
factor L 

Input: A with a symmetric sparsity pattern, its undirected graph G and elimination 
tree 7 described by the vector parent. 

Output: Row sparsity patterns row , {i} of the Cholesky factor L of A (1 <i < n). 


1: for i = 1 : n do > Loop over the rows of A 
2: row, {i} =Ø > Initialisation 
3: mark(i) =i 

4: for k € adjg{i} andk < i do > Loop over the below diagonal entries in 

row i 

5 j=k 

6 while mark(j) 4 i do > Column j not yet encountered in row i 
T mark(j)=i > Flag j as encountered in row i 
8 rowz{i} =row,{i} U{j} > Add j to the sparsity pattern of row i 
9: j = parent(j) > Move up the elimination tree 
10: end while 
11: end for 


12: end for 
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Figure 4.6 An illustration of the sparsity pattern of A and its graph G(A) (left) and the sparsity 
pattern of the corresponding skeleton matrix A~ and graph G(A_ ) (right). The entries in A and 
edges of G(A) that do not belong to the skeleton matrix and graph are depicted in red. 


Efficiency can be improved by employing the skeleton graph G(A_ ) that is 
obtained from G(A) by removing every edge (i, j) for which j < i and j is not 
a leaf vertex of 7;(i). G(A7 ) is the smallest subgraph of G(A) with the same filled 
graph as G(A). The corresponding matrix is the skeleton matrix. An example is 
given in Figure 4.6. The complexity of constructing the elimination tree using the 
skeleton matrix and its graph G(A~) is O(nz(A7) g(nz(A_ ), n)), where nz(A_ ) is 
the number of entries in the skeleton matrix. Because nz(A_ ) is often significantly 
smaller than nz(A), an implementation that processes G(A~) rather than G(A) can 
be substantially faster. 

Analogously to the row sparsity patterns, let col, {j} denote the sparsity pattern 
of the off-diagonal part of column j of L, that is, 


colr {j} = S(Lj+rn j) = {ili > j, iy FO}, Ls jn. 
The column replication principle can be written as 
coli {j} C colr {parent (j)} U parent (j). 


Theorem 4.8 describes colz {j} using the vertices of the subtree 7 (j). 


Theorem 4.8 (George & Liu 1980c, 1981) The column sparsity pattern coli {j} 
of the Cholesky factor L of the matrix A is equal to the adjacency set of vertices of 
the subtree T (j) in G(A), that is, 
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Figure 4.7 Two topological orderings of an elimination tree. 


coli {j} = adjgca){T (j)}- (4.5) 


Proof Ifi € colz{j}, then j € row, {i}, and Theorem 4.3 implies j € ancy{k} for 
some k such that aiz # 0. That is, i € adjg{T(j)}. Conversely, i € adjg{T (j)} 
implies that in row i the entry in column j of L is nonzero. Thus, j € row {i}, and 
hence, i € col, {j}. oO 


Algorithm 4.3 can be used to compute the column counts and the column sparsity 
patterns because when j is added to row , {i} at line 8, i can be added to colz {j}. 
This does not generally obtain the column sparsity patterns sequentially. To derive 
an approach that does compute them sequentially, rewrite (4.5) as follows: 


colt {j} = | adjgcay{i) LJ col tk} | \ {j} 
{k | keT (DWH 


Using the column replication, this can be significantly simplified 


colL{j} = | adjgcayty} U colr{k} | \ {l,..., J}. (4.6) 
{k| j=parent(k)} 


This is used to obtain Algorithm 4.4, which constructs the sparsity pattern of each 
column j of L as the union of the sparsity pattern of column j of A (adjg(a){j}) and 
the patterns of the children of j in 7 (A). Here child{j} denotes the set of children 
of j. Because any child k of j satisfies k < j, the j-th outer step has the information 
needed to compute the sparsity pattern described by (4.6). Observe that 7 (A) does 
not need to be input. 
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ALGORITHM 4.4 Determining the sparsity patterns of each column of L 
Input: A with symmetric sparsity pattern and its undirected graph G. 
Output: Column sparsity patterns colz {j} of the Cholesky factor L of A (1 < j < 


n). 
1: for j = 1 : n do > Loop over the columns of L 
2 child{j} = Ø > Initialisation 
3 coli {j} = adjg{j} \ {1,..., j — 1} 
4: for k € child{j} do > Unifying child structures in (4.6) 
5: coli {j} = colL{j} U colz{k} \ {j} 
6: end for 
7 if col. {j} A Ø then 
8 l = min{i |i € colL {j} 
9 child{l} = child{l} U{j} œ> Parent of j detected using Corollary 4.6 


10: end if 
11: end for 


4.4 Topological Orderings 


The outer loop in Algorithm 4.4 does not have to be performed in the strict order j = 
1,..., n. What is necessary is that for each step j, the column sparsity pattern for 
each child of j has already been computed. An ordering of the vertices in a tree (and, 
more generally, in a DAG) is a topological ordering if, for alli and j, j € descr {i} 
implies j < i (Section 2.2). Observation 4.2 confirms that the ordering of vertices 
in the elimination tree 7 is a topological ordering. A new topological ordering of T 
defines a relabelling of its vertices corresponding to a symmetric permutation of A. 
This is illustrated in Figure 4.7. The sparsity patterns of the Cholesky factors of A 
and PAP? can be different, but the following result shows that the amount of fill-in 
is the same. 


Theorem 4.9 (Liu 1990) Let S{A} be symmetric. If P is the permutation matrix 
corresponding to a topological ordering of the elimination tree T of A, then the 
filled graphs of A and PAP" are isomorphic. 


There are many topological orderings of 7. One class is obtained using the depth- 
first search given by Algorithm 2.1. This searches all the components of 7 starting 
at their root vertices. In this case, once vertex i has been visited, all the vertices of 
the subtree 7 (i) are visited immediately after i and i is labelled as the last vertex of 
T (i). A topological ordering of 7 is a postordering if the vertex set of any subtree 
T(i) (i = 1,...,n) is a contiguous sublist of 1,...,7. Unless additional rules on 
how vertices are selected are imposed, a postordering is generally not unique, as 
demonstrated in Figure 4.8. One possible postordering is defined in Algorithm 2.1. 
In this case, there is some freedom in the depth-first search to choose from the 
vertices that have not been visited, resulting in different postorderings. 


4.5 Leaf Vertices of Row Subtrees 65 


Figure 4.8 An example to illustrate the non-uniqueness of postorderings of an elimination tree. 


4.5 Leaf Vertices of Row Subtrees 


Leaf vertices of row subtrees play a key role in graph algorithms related to sparse 
Cholesky factorizations. They can be used to find the skeleton matrix described in 
Section 4.3, and they are important in parallel processing based on fundamental 
supernodes (see Section 4.6.1). Theorem 4.10 describes the relation between 
standard subtrees of 7 and row subtrees obtained by pruning (Section 4.2). This 
pruning is determined by the leaf vertices of row subtrees. 


Theorem 4.10 (Liu 1986) Let the elimination tree T of A be postordered. Let the 
column indices of the nonzeros in the strictly lower triangular part of row i of A be 
C1, .-.-, Cs Withs > landO < cy <... < cs <i. Then c; is a leaf vertex of the row 
subtree T,(i) if and only if 


t=1 or (1<t<s and ci- €T(G)). 


Proof c is always a leaf vertex of 7,(i). If this is not the case, then there exists a 
directed path from some vertex k, k Æ cı toi via cı such that k € 7,(i) and aik 4 0. 
Postordering of 7 implies k < c1. This is a contradiction because c4 is the index of 
the first nonzero in row i. 

Consider now t > 1. Assume that c;_; € T (c+) and that c; is a leaf vertex of 
T, (i). Row replication (Theorem 4.2) implies any k € ancy {c;-1}U{c;—1} such that 
Cr-1 < k < i satisfies lj, # 0. Because 7 is postordered, c+—1 < k < cr, and there 
is at least one k < c; satisfying this inequality. It follows that k = c;_;. Because k 
belongs to 7; (i), c; cannot be a leaf vertex of 7; (i), which is a contradiction. 

Conversely, assume for t > 1 that c;_1 ¢ T(c;) and c is not a leaf vertex of 
T, (i). From the second part of the assumption and the fact that c; € 7; (i), it follows 
that there is at least one leaf vertex k < i of 7,(i) from which there is a directed 
path to i via c;. Thus k < c;. From the definition of the postordering of 7, all 
vertices 1! with k < l < c; are vertices of 7 (c+). Vertex c;-; must be among them 
and c;-1 € T (c;). This contradiction completes the proof. o 
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ALGORITHM 4.5 Find the sizes of subtrees 7 (i) of T 
Input: Elimination tree 7 described by the vector parent. 
Output: Subtree sizes |7(i)| (1 <i < n). 
:|7Td:n|=1 
: fori = 1 : n — 1 do 


ITH =ITHIF+ITOI 


1 

2. 

3: k = parent (i) 
4 

5: end for 


Corollary 4.11 (Liu 1986) Under the assumptions of Theorem 4.10, c, is a leaf 
vertex of T; (i) if and only if 


t=1 or (l<t<s and c1 < ci —|T(q)| +1). 


Subtree sizes can be computed using Algorithm 4.5. Correctness of Algo- 
rithm 4.5 is guaranteed because parent defines a topological ordering of T. 

Theorem 4.12 relaxes the condition that the entries in the rows of A are sorted 
by increasing column indices. This allows the leaf vertices of the row subtrees to be 
determined by columns. 


Theorem 4.12 (Liu et al. 1993) Consider the elimination tree T of A. Vertex j is 
a leaf vertex of some row subtree of T if and only if there exists i € adjga){j}, 
j <i <n, such that i g adjgay{k} for all k € T(j) \ {j} 


Proof Assume that for some i € ancy{j} vertex j is a leaf vertex of 7; (i). That is, 
i € adjgcay){j},i > j. Suppose there exists k in T(j)\{j} such that i € adjgqa) {k}. 
Then all the ancestors of k, k < i, in particular j, belong to 7; (i) and j cannot be a 
leaf vertex of 7,(i). This is a contradiction. 

Conversely, assume that j is not a leaf vertex of any row subtree of 7 and that 
there exists i € adjgi,){j}, j < i < n, such that i ¢ adjg;,y{k} for all k € 
T(J) \ {j}. Because j is not a leaf vertex of any such 7; (i), Theorem 4.3 implies 
that there exists k € T (j) \ {j} such that aig 4 0, which gives a contradiction and 
completes the proof. o 


To find leaf vertices of row subtrees of 7, Algorithm 4.6 uses a marking scheme 
based on Theorem 4.12 and exploits the postordering of 7. The auxiliary vector 
prev_nonz stores the column indices of the most recently encountered entries in 
the rows of the strictly lower triangular part of A. 


4.6 Supernodes and the Assembly Tree 


Because of column replication, the columns of L generally become denser as the 
Cholesky factorization proceeds. Exploiting this density can significantly enhance 
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ALGORITHM 4.6 Find leaf vertices of row subtrees of 7 

Input: A with a symmetric sparsity pattern and a corresponding postordered 
elimination tree 7. 

Output: Logical vector isleaf with entries set to true for leaf vertices of row 
subtrees. 


1: isleaf(1:n) = false, prev_nonz1:n)=0 


2: Compute |7 (1 : n)| > Use Algorithm 4.5 
3: for j = 1 : n do > Loop over the columns of A 
4: for i such that i > j and a;; #0 do > Row index in strictly lower 


triangular part of A 


5 k = prev_nonz(i)> Column index of most recently seen entry in row i 
6 ifk < j —|7(j)| + 1 then 

T7: isleaf (j) = true œ j is a leaf vertex by Corollary 4.11 
8 end if 

9 prev_nonz(i)= j > Flag j as the most recently seen entry in row i 
10: end for 
11: end for 


the performance of the numerical factorization in terms of both computation time 
and memory requirements. For this, we require the concept of supernodes. The idea 
is to group together columns with the same sparsity structure, so that they can be 
treated as a dense matrix for storage and computation. Let 1 < s,t < n with 
s+t— 1 < n. A set of contiguously numbered columns of L with indices 
S = {s,s + 1,...,s +t -— 1} is asupernode of L if 


col {s} U {s} = colr {s +t — 1} U {s,...,s +t — 1}, (4.7) 


and S cannot be extended for s > 1 by adding s — 1 or for s +t — 1 < n by adding 
s + t. Because S cannot be extended, it is a maximal subset of column indices. 
In graph terminology, a supernode is a maximal clique of contiguous vertices of 
G(L + LT). A supernode may contain a single vertex. Figure 4.9 illustrates the 
supernodes in a Cholesky factor of order 8. 

The supernodal elimination or assembly tree is defined to be the reduction of 
the elimination tree that contains only supernodes. Each vertex of the elimination 
tree is associated with one elimination, and a single integer (the index of its parent) 
is needed. Associated with each vertex of the assembly tree is an index list of the 
row indices of the nonzeros in the columns of the supernode. These implicitly define 
the sparsity pattern of L. An example that demonstrates the difference between the 
elimination and assembly trees is given in Figure 4.10. Here the elimination tree is 
postordered, and there are 5 supernodes: {1, 2}, 3, 4, 5, {6, 7, 8, 9}. For supernode 1 
that comprises columns 1 and 2, the row index list is {1, 2, 8, 9}. 
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Figure 4.9 An example to illustrate supernodes in L. The first supernode comprises columns 1 
and 2, the second columns 3 and 4, and the third columns 5-8. 
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Figure 4.10 A sparse matrix and its postordered elimination tree (left) and postordered assembly 
tree (right). Filled entries in S{L + LT} are denoted by f. For the assembly tree, the vertices are 
in red and the index lists associated with each vertex are given. 


Supernodes can be characterized by the following result on the column counts 
of L, from which we see that supernodes can be found using column counts rather 
than the column sparsity patterns that appear in (4.7). 


Theorem 4.13 (Liu et al. 1993) The set of columns of L with indices S = {s,s + 
1,...,s +t — 1} is a supernode of L if and only if it is a maximal set of contiguous 


columns such that s +i — 1 is a child of s +i fori = 1,...,t — 1 and 


| colL {s} | = | colr {s +t— 1}|+t-1. 


(4.8) 
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Proof Let S be a supernode. For i, j € S with i > j, we have i € col,{j}. This 
implies that in the postordered elimination tree the vertex i = j + 1 is the parent of 


j for j = s,...,8 +t — 2. Moreover, from Observation 4.2, for any i, j € S with 
i> j,i € colL{j} implies coly{j}\ {1,...,i} C col, {i}. Therefore, 
| col {s +i}| > |colr{s+i—1}|—- 1, i=1,...,t-1, (4.9) 


with equality if and only if 
colr {s +i} = col {s +i — 1} \ {s +i}, 


that is, if S is a supernode. 

Conversely, assume S is a maximal set of contiguous columns such that, for 
i=1,...,t-—1,s +i — lis a child of s + i and S satisfies (4.8). Because of 
column replication, such a sequence of parent and child vertices must satisfy (4.9) 
with equality if and only if (4.7) is satisfied. It follows that S is a supernode. o 


Supernodes enhance the efficiency of sparse factorizations and sparse triangular 
solves because they enable floating-point operations to be performed on dense 
submatrices rather than on individual nonzeros, thus improving memory hierarchy 
utilization and allowing the use of highly efficient dense linear algebra kernels (such 
as Level 3 BLAS kernels). Because the rows and columns of a supernode have 
a common sparsity structure, this only needs to be stored once, reducing indirect 
addressing. Supernodes help to increase the granularity of tasks, which is useful for 
improving the computation to overhead ratio in a parallel implementation. Fill-in 
results in supernodes near the root of the assembly tree often being much larger 
than those close to the leaf vertices. 

Observe that the columns within a supernode are numbered consecutively, 
but they can be numbered within the supernode in any order without changing 
the number of nonzeros in L (assuming the corresponding rows are permuted 
symmetrically). On some architectures, particularly those using GPUs, this freedom 
can be exploited to improve the factorization efficiency. Essentially, it is desirable 
to order the columns within a supernode such that the entries of L form fewer but 
less fragmented dense blocks. 

Some applications, such as power grid analysis, in which the basis of the linear 
system is not a finite element or finite difference discretization of a physical domain, 
can lead to sparse matrices that incur very little fill-in during factorization. The 
supernodes can then be very small, and the costs associated with identifying them 
may not be offset by the increase in performance resulting from the potential for 
block operations. However, as supernodes can offer such significant performance 
gains, it can be advantageous to merge (small) supernodes that have similar (but 
not exactly the same) nonzero patterns, despite this increasing the overall fill-in and 
operation count. This process is termed supernode amalgamation, and the resultant 
nodes are often referred to as relaxed supernode. 
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In practice, fundamental supernodes are easier to work with in the numerical 
factorization. Let 1 < s,¢ < n with s +t — 1 < n. A maximal set of contiguously 
numbered columns of L with indices $ = {s,s+1,...,5s+t—1} is a fundamental 
supernode if for any successive pair i — 1 andi in the list, i — 1 is the only child of i 
in 7 and colz {i} U {i} = colz {i — 1}. s is termed the starting vertex. An example is 
given in Figure 4.11. The difference between the sets of supernodes and fundamental 
supernodes is normally not large, with the latter having (slightly) more blocks in the 
resulting partitioning of L. Note that fundamental supernodes are independent of the 
choice of the postordering of 7. Theorem 4.14 describes the relationship between 
fundamental supernodes and the leaf vertices of row subtrees of 7. In particular, it 
characterizes starting vertices of the fundamental supernodes. The leaf vertices of T 
are trivially starting vertices of fundamental supernodes. But, possibly surprisingly, 
so too are the leaf vertices of row subtrees. 


Theorem 4.14 (Liu et al. 1993) Assume T is postordered. Vertex s is the starting 
vertex of a fundamental supernode if and only if it has at least two child vertices in 
T or it is a leaf vertex of a row subtree of T. 


Proof If s has at least two child vertices then, from the definition of a fundamental 
supernode, it must be the starting vertex of a fundamental supernode. Assume that, 
for some i > s, s is a leaf vertex of 7,(i). If s is also a leaf vertex of 7, then s 
is a starting vertex of a supernode. The remaining case is s having only one child. 
Because 7 is postordered, this child must be s — 1. Theorem 4.3 then implies ais 4 
O and ai s—1 = 0, that is, i € coli {s} andi ¢ coli {s — 1}. It follows that 


S{Ls—1:n,s—1} G S{Ls:n,s} U{s — 1}, 


and vertices s and s — 1 cannot belong to the same supernode. Hence, s is the 
starting vertex of a new fundamental supernode. 


123 4 5 6 
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Figure 4.11 A matrix A and its postordered elimination tree 7 for which the set of supernodes 
{1, 2} and {3, 4, 5, 6} and the set of fundamental supernodes {1, 2}, {3, 4} and {5, 6} are different. 
The filled entries in S{L + LT } are denoted by f. 


4.7 Notes and References 71 


Conversely, assume that s is the starting vertex of a fundamental supernode S. If s 
has no child vertices or at least two child vertices, the result follows. If s has exactly 
one child vertex, postordering implies this child is s — 1. Because S is maximal, 
there exists i such that i ¢ col,{s — 1} and i € colz,{s} (otherwise S could be 
extended by adding s — 1). Hence, s is a leaf vertex of 7,(i). o 


Because fundamental supernodes are characterized by their starting vertices, they 
can be found by modifying Algorithm 4.6 to incorporate marking leaf vertices of the 
row subtrees and vertices with at least two child vertices. Once the elimination tree 
has been computed, the complexity is O(n +nz(A)). The computation can be made 
even more efficient by using the skeleton graph G(A7 ). 


4.7 Notes and References 


The excellent monographs by Tewarson (1973), George & Liu (1981), and Davis 
(2006) represent milestones in the development of contemporary symbolic factor- 
ization algorithms and their implementation. A complementary way to follow many 
of the developments is by looking at the early software (and accompanying user 
documentation), such as YSMP (Eisenstat et al., 1982) and SPARSPAK (George 
& Ng, 1984). In addition, there are several influential survey articles focusing on 
sparse Cholesky algorithms and emphasizing the crucial role of the elimination tree, 
for example, Liu (1990), George (1998); see also Bollhdfer & Schenk (2006), Hogg 
& Scott (2013a) and the more recent comprehensive survey of Davis et al. (2016). 
The latter provides a general overview of much of the research related to sparse 
direct methods and includes pointers to many specialized references. 

There are a large number of journal articles that provide a fuller understanding 
of the theory and algorithms employed in symbolic factorizations. Schreiber (1982) 
defines the elimination tree of a sparse symmetric matrix. The seminal paper of Liu 
(1986) describes elimination tree construction, while for an extensive overview of 
the roles of elimination trees and topological orderings as well as the determination 
of the column sparsity patterns of the factor L, we refer to Liu (1990). If only row 
and column counts of L are needed, the fastest known algorithms are described in 
Gilbert et al. (1994). This paper also refers to another admirable paper of Liu et al. 
(1993) that describes the efficient computation of fundamental supernodes based on 
the leaf vertices of row subtrees of the elimination tree. 

A key driver behind research into efficient (in terms of time and memory) 
sparse Cholesky algorithms has always been the development of computational 
codes. Many currently available packages implement not only sparse Cholesky 
factorizations but also more general LDLT factorizations of sparse symmetric 
indefinite matrices. The software is necessarily highly sophisticated and is therefore 
generally accompanied by technical reports and/or journal publications that explain 
the data structures and choices that were made in the algorithm and software design 
as well as providing details of the different options that are offered (examples 
include Duff (2004), Reid & Scott (2009), Hogg et al. (2010)). 
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Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, 
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate 
credit to the original author(s) and the source, provide a link to the Creative Commons license and 
indicate if changes were made. 

The images or other third party material in this chapter are included in the chapter’s Creative 
Commons license, unless indicated otherwise in a credit line to the material. If material is not 
included in the chapter’s Creative Commons license and your intended use is not permitted by 
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from 
the copyright holder. 


Chapter 5 A 
Sparse Cholesky Solver: The P 
Factorization Phase 


The adoption of Cholesky’s method owes not a little to the 
publicity given to it shortly after the end of World War II by 
British mathematicians and computer pioneers, including Alan 
Turing, Leslie Fox, Jim Wilkinson, and especially John Todd — 
Benzi (2017). 


Achieving high performance for sparse direct solvers in general, 
and sparse Cholesky factorization, in particular, is a very well 
researched topic — Rennich et al. (2016) 


Having considered the symbolic phase of a sparse Cholesky solver in the previous 
chapter, the focus of this chapter is the subsequent numerical factorization phase. 
If A is a symmetric positive definite (SPD) matrix, then it is factorizable (strongly 
regular) and (in exact arithmetic) its Cholesky factorization A = LL’ exists. LDLT 
factorizations of general symmetric indefinite matrices are considered in Chapter 7. 


5.1 Dense Cholesky Factorizations 


Because efficient implementations of sparse Cholesky factorizations rely heavily on 
exploiting dense blocks, we first consider algorithms for the Cholesky factorization 
of dense matrices that can be applied to such blocks. Algorithm 5.1 is a basic left- 
looking algorithm. It is an in-place algorithm because L can overwrite the lower 
triangular part of A (thus reducing memory requirements if A is no longer required). 

Writing A in the block form (1.2), the computation can be reorganized to give 
Algorithm 5.2. This allows the exploitation of Level 3 BLAS for the computa- 
tionally intensive components (dense matrix-matrix multiplies and dense triangular 
solves). Here A has nb block columns, which are referred to as panels. Step 6 can 
be performed using Algorithm 5.1. 

Algorithms 5.1 and 5.2 are left-looking. This means that the updates are not 
applied immediately. Instead, all updates from previous (block) columns are applied 
together to the current (block) column before it is factorized. In a right-looking 
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ALGORITHM 5.1 In-place dense left-looking Cholesky factorization 
Input: Dense SPD matrix A. 
Output: Factor L such that A = LL’. 


1: for j = 1 : n do 

2 Lijn j = Ajn,j > Only the lower triangular part of A is required 
3 fork = 1 : j — 1 do 

4 Lj:n,j = Lj:n,j — Lj:n,k ljk > Update column j using previous columns 
5: end for 

6 ly=(} a)! 2 > Overwrite the diagonal entry with its square root 
7 Litia, j = Lj+in,j/ljj > Scale off-diagonal entries in column j 
8: end for 


ALGORITHM 5.2 In-place dense left-looking panel Cholesky factorization 
Input: Dense SPD matrix A in the form (1.2) with nb panels. 
Output: Factor L such that A = LL’. 


1: for jb = 1 : nb do 


2 L jb:nb, jb = A jb:nb, jb 

3 for kb = 1 : jb — 1 do 

4: L jb:nb, jb = L jp:nb, jb = L jb:nb,kb Ly kb > Update block column jb 

5 end for 

6 Compute in-place factorization of L jp, jb > Overwrite L jp, jp with its 
Cholesky factor 

T; L jb+i:nb,jb = L jb+1:nb, jb Lijo > Dense triangular solve 

8: end for 


approach (Algorithm 5.3), outer product updates are applied to the part of the matrix 
that has not yet been factored as they are generated. 

The large panel updates can be split into operations involving only blocks. This 
is shown in Algorithm 5.4 for the right-looking approach. 

The panel and block descriptions of the factorization enable efficient 
parallelization. The three main block operations, which are called tasks, are 
factorize( jb), solve(ib, jb), and update(ib, jb, kb). There are the following 
dependencies between the tasks. 


factorize( jb) depends on update(jb, kb, jb) for all kb=1,..., jb— 1. 

solve(ib, jb) depends on update(ib, kb, jb) for all kb=1,..., jb— 1, and 
factorize( jb). 

update(ib, jb, kb) depends on solve(ib, kb), solve( jb, kb). 
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ALGORITHM 5.3 In-place dense right-looking panel Cholesky factorization 
Input: Dense SPD matrix A in the form (1.2) with nb panels. 
Output: Factor L such that A = LL’. 


1: for jb = 1 : nb do 


2: L jo:nb, jb = A jb:nb, jb 
3: end for 
4: for jb = 1 : nb do 
5: Compute in-place factorization of L jp, jp > Overwrite L jp, jp with its 
Cholesky factor 
L joticnb, jb = L jb+t:nb, jb Lip jp > Dense triangular solve 


6 

7: for kb = jb+1:nbdo 

8 Lkb:nb,kb = Lkb:nb,kb — Lkb:nb, jb Lin jb 
9: end for 

10: end for 


ALGORITHM 5.4 In-place dense right-looking block Cholesky factorization 
Input: Dense SPD matrix A in the form (1.2) with nb x nb blocks. 
Output: Factor L such that A = LL’. 

1: for jb = 1 : nb do 

2: L jb:nb, jb = A jb:nb, jb 

3: end for 

4: for jb = 1 : nb do 


5 Compute in-place factorization of L jp, jb > Task factorize(jb) 
6 for ib = jb + 1 : nb do 

T: Lip, jb = Lib, jb Lin ip > Task solve(ib, jb) 
8 for kb = jb+1:ibdo 

9: Lib kb = Lib,ko — Liv, jb Lip, jp > Task update(ib, jb, kb) 
10: end for 

11: end for 

12: end for 


A dependency graph can be used to schedule the tasks. Its vertices correspond to 
tasks and dependencies between tasks are represented as directed edges. The result 
is a directed acyclic graph (DAG). A task is ready for execution if and only if all 
tasks with incoming edges to it have completed. DAG-driven linear algebra uses 
either a static or dynamic schedule based on these graphs to implement the tasks 
in a parallel environment. In practice, it is not necessary to explicitly compute the 
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task DAG: it can be constructed on-the-fly taking into account the dependencies. 
The task DAG allows a lot of flexibility in the order in which tasks are carried out: 
the left- and right-looking approaches correspond to particular restricted orderings 
of the tasks. 


5.2 Introduction to Sparse Cholesky Factorizations 


There are several classes of algorithms that implement sparse Cholesky factor- 
izations. Their major differences relate to how they schedule the computations. 
This affects the use of dense kernels, the amount of memory required during the 
factorization as well as the potential for parallel implementations. As in the dense 
case, the factorization is split into tasks that involve computations on and between 
dense submatrices and the precedence relations among them can be captured by a 
task graph. 

We start by extending the dense Cholesky factorizations to the sparse case 
in a straightforward way. In practice, it is essential for efficiency to exploit the 
supervariables of A and the supernodes of L. Thus, while for simplicity of the 
descriptions and notation, we refer to rows and columns of A and L, these typically 
represent block rows and block columns and, as in the above discussion of the dense 
block factorization algorithm, the entries of A and L are then submatrices. 

The entries of L satisfy the relationship 
j—1 me 


j—1 
: 2 

Listing = |Ajeing — D0 Ljstnalja) (ly with jy = | ayy — OU, , 
k=1 k=1 


and from this we deduce the following result. 


Theorem 5.1 (Liu 1990) The numerical values of the entries in column j > k of 
L depend on the numerical values in column k of L if and only if ljg # 0. 


The theoretical background of the previous chapter based on the elimination 
tree 7 enables the dependencies in Theorem 5.1 to be searched for efficiently. In 
particular, 7 allows the row (or column) counts of L to be computed and they can 
be used to allocate storage for L. It can also be used to find supernodes and the 
resulting (block) elimination tree can then be employed to determine the (block) 
column structure of L. In practice, it can be beneficial to split large supernodes into 
smaller panels to better conform to computer caches. 

Algorithms 5.5 and 5.6 are simplified sparse left- and right-looking Cholesky 
factorization algorithms that are straightforward sparse variants of Algorithms 5.1 
and 5.4, respectively (the latter with nb = n, that is, without considering blocks). 
Here, we assume that the sparsity pattern of L has already been determined in 
the symbolic phase and static storage formats based, for example, on compressed 
columns and/or rows are used. 
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ALGORITHM 5.5 Simplified sparse left-looking Cholesky factorization 
Input: SPD matrix A and sparsity pattern S{L}. 
Output: Factor L such that A = LL’. 


1: lj = aij for all (i, j) € S{L} > Filled entries in L are initialised to zero 
2: for j = 1 : n do 


3 for k € {k < j | ljg # 0} do 
4 fori € {i > j |lix # 0} do 
5 lig = lij — likl jx 

6: end for 

7 end for 

8 lj =p 

9: for i € {i > j |li; # 0} do 
10: lij = lij/ljj 

11: end for 

12: end for 


ALGORITHM 5.6 Simplified sparse right-looking Cholesky factorization 
Input: SPD matrix A and sparsity pattern S{L}. 
Output: Factor L such that A = LL’. 


1: lij = aij for all (i, j) € S{L} > Filled entries in L are initialised to zero 
2: for j = 1 : n do 

3o ly =G) 

4 fori € {i > j |li; # 0} do 

5 lij = lij/ljj 

6: end for 

7 for k € {k > j |Ik; # 0} do 
8 fori € {i > k |li; # 0} do 
9 lik = lik — lijlkj 


10: end for 
11: end for 
12: end for 


An alternative for sparse matrices held in row-wise format is to compute L one 
row at a time. This is sometimes called an up-looking factorization because rows 
1 toi — 1 are employed to compute row i (i > 1). The approach is asymptotically 
optimal in the work performed and for highly sparse matrices it is potentially 
extremely efficient because the entries of A are used in the natural order in which 
they are stored. However, it is difficult to incorporate high level BLAS. 
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The following relation holds for the i-th row of L 


T agoi a xa 2 
Liii-1 = Lii- i14 ki-i with ij; 


T 
ij = Gii — Li:i-10; 4.;-1- 


The application of L}. 1.1:;—1 Can be implemented by solving the triangular system 
L4:i-1,1:i-1Y = Å kilis 


and setting Li i—1 = y. The following result can be used to determine the sparsity 
pattern of y. 


Theorem 5.2 (Gilbert 1994) Consider a sparse lower triangular matrix L and the 
DAG G(L") with vertex set {1,2,..., n} and edge set {(j —~ i) | li; # 0}. The 
sparsity pattern S{y} of the solution y of the system Ly = b is the set of all vertices 
reachable in G(L" ) from S{D}. 


Proof From Algorithm 3.4 and assuming the non-cancellation assumption, we see 
that (a) if b; # 0, then y; # O and (b) if for some j < i, yj # 0 and l;; # 0, then 
yi % 0. These two conditions can be expressed as a graph transversal problem in 
G (LT). (a) adds all vertices in S{b} to the set of visited vertices and (b) states that 
if vertex j has been visited, then all its neighbours in G (L7) are added to the set of 
visited vertices. Thus S{y} = Reach(S{b}) U S{b}. o 


Figure 5.1 illustrates the sparsity patterns of a lower triangular matrix L and 
vector b together with G (LT). The vertices that are reachable from S {b} = {2,4} 
are 5 and 6 and thus S{y} = {2, 4, 5, 6}. 

Algorithm 5.7 outlines a sparse row Cholesky factorization that is based on the 
repeated solution of triangular linear systems. Theorem 5.2 can be used to determine 
the sparsity pattern of row i at Step 3, that is, by finding all the vertices that are 
reachable in GLE aaj) from the set {i | aj; # 0, i < j}. A depth-first search 
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Figure 5.1 An example to illustrate L, b and G(L Ty. 
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ALGORITHM 5.7 Sparse up-looking Cholesky factorization 
Input: SPD matrix A. 
Output: Factor L such that A = LL’. 


1: hi = (a1)! 


2: fori =2:ndo 


3: Find S{Lj,1.;-1} > Sparsity pattern of row i 
4 Li gai = bp ge Vi=1j > Sparse triangular solve 
5: lii = Qi — Litil i 

6 lü = (u)? 

7: end for 


ofG (LT. once ,) determines the vertices in the row sparsity patterns in topological 
order, and performing the numerical solves in that order correctly preserves the 
numerical dependencies. Alternatively, because nonzeros of L;,1.;-1 correspond to 
the vertices in the i-th row subtree 7;(i) that are not equal to i, another option is to 
find the row subtrees using 7 (A). 


5.3 Supernodal Sparse Cholesky Factorizations 


The simplified schemes form the basis of sophisticated supernodal algorithms that 
are designed to be efficient in parallel computational environments. Consider the 
right-looking variant and recall that a supernode consists of one or more consecutive 
columns of L with the same sparsity pattern. These nonzeros are stored as a dense 
trapezoidal matrix (only the lower triangular part of the block on the diagonal needs 
to be stored and the rows of zeros in the columns of the supernode are not held). 
This is termed a nodal matrix (see Figure 5.2). 

Once a supernode is ready to be factorized, a dense Cholesky factorization of the 
block on the diagonal of the nodal matrix is performed (one of the approaches of 
Section 5.1 can be used). Then a triangular solve is performed with the computed 
factor and the rectangular part of the nodal matrix. The next step is to iterate over 
ancestors of the supernode in the assembly tree. For each parent, the rows of the 
current supernode corresponding to the parent’s columns are identified, and then 
the outer product of those rows and the part of the supernode below those columns 
formed (update operations). The resulting matrix can be held in a temporary buffer. 
The rows and columns of this buffer are matched against indices of the ancestors 
and are added to them in a sparse scatter operation. For efficiency, the updates may 
use panels so that the temporary buffer remains in cache. 
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Figure 5.2 An illustration of a supernode (left), the corresponding nodal matrix (centre), and the 
nodal matrix with two panels (right). The shaded lower triangular part of the block on the diagonal 
and the shaded block rows are treated as dense. 


5.3.1 DAG-Based Approach 


The DAG-based approach can also be extended to the sparse case. Each nodal matrix 
is subdivided into blocks. The factorization is split into tasks in which a single block 
is revised. The key difference compared to the dense case is that it is necessary to 
distinguish between two types of update operations: update_internal performs the 
update between blocks in the same nodal matrix and update_between performs 
the update when the blocks belong to different nodal matrices. Thus the sparse 
Cholesky factorization is split into the following tasks; the first two are illustrated 
in Figure 5.3. In this example, the nodal matrix has two block columns that do not 
contain the same number of columns. 


factorize_block(Lgjag) Computes the dense Cholesky factor Lajag of the block 
on the diagonal (leftmost plot). If the block is trapezoidal, the factorization is 
followed by a triangular solve of its rectangular part Lrect = resb : (centre 
plot). 

solve_block(L4est) Performs a triangular solve of an off-diagonal block Lyes; of 
the form Ldest = E r g (rightmost plot). 

update_internal(Lgest , Lr, Le) Performs the update Ldest = Laest — L,LT, 
where Laes;, Ly and Le belong to the same nodal matrix. 

update_between(Lgest , Lr, Le) Performs the update Lyes; = Ldest — L;i, 
where L, and Le belong to the same nodal matrix and Les; belongs to a different 
nodal matrix. 


Again, the tasks are partially ordered and a task DAG is used to capture the 
dependencies. For example, the updating of a block of a nodal matrix from a block 
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Figure 5.3 An illustration of a blocked nodal matrix with two block columns. The first block on 
the diagonal is triangular and the second one is trapezoidal. The task factorize_block is illustrated 
on the left and in the centre; the task solve_block is illustrated on the right. 


column of L that is associated with a descendant of the supernode has to wait 
until all the relevant rows of the block column are available. At each stage of the 
factorization, tasks will be executing (in parallel) while others are held (in a stack 
or pool of tasks) ready for execution. 


5.4 Multifrontal Method 


The multifrontal method is an alternative way to compute a sparse Cholesky 
factorization. To discuss this popular approach, we use the following result that 
determines which rows and columns influence particular Schur complements using 
the terminology of the elimination tree. 


Theorem 5.3 (Liu 1990) Let A be SPD and let T be its elimination tree. The 
numerical values of entries in column k of the Cholesky factor L of A only affect the 
numerical values of entries in column i of L fori € ancy{k} (1 <k<i<n-—l1). 


Proof From (4.1), setting S} = A, for k > 2 the (n — k + 1) x (n — k + 1) Schur 
complement S“ can be expressed as 


lk,k-1 
k-1 : k-1 
se = Se g ; (lk x-1 nis In.k-1) = AN = Lien k-1 Link- 
ln,k-1 
(5.1) 
Theorem 4.2 implies that all nonzero off-diagonal entries /;, in column k of L 
explicitly used in the update (5.1) are such that i € anc7{k}. Considering the 
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Cholesky factorization as a sequence of Schur complement updates, only columns i 
with i € ancz {k} can be influenced numerically by the Schur complement update 
in the k-th step of the factorization, and the result follows. o 


The computation of subsequent Schur complements by adding individual updates 
as in (5.1) is straightforward; the multifrontal method employs further modifications 
and enhancements of this basic concept. First, because the vertices of J are 
topologically ordered, the order in which the updates are applied progresses up the 
tree from the leaf vertices to the root vertex. This allows the computation of S® to 
be rewritten as 

s® = Ak:n,k:n _ a Ltn, Ue j» 
JET (k)\{k} 


emphasizing the role of 7. In place of Schur complements, the multifrontal method 
uses frontal matrices connected to subtrees of T. Assume k, k,..., kp are the row 
indices of the nonzeros in column k of L. The frontal matrix F% of the k-th subtree 
T (k) of T is the dense (r + 1) x (r + 1) matrix defined by 


Akk Akk; -- - Akk, lkj 
akkO ... 0 Irj 
3 ee - > (lx; lin j <- Uj) (5.2) 
a JETON 
a0... 0 lj 


One step of the Cholesky factorization of Fg can be written as 


lk 0...0 lkk Ikik .-- lk k 
lkik 1 0 

Fe=] . . (5.3) 

: I Vi : I 
lkk 0 
lkk 
lkık o 
= (lkk lekos lk,k) + ; (5.4) 
Vk 

lkk 


where Vx is termed a generated element (it is also sometimes called an update 
matrix or a contribution block). The name “generated element” is because the 
multifrontal method has its origins in the simpler frontal method, which uses a 
single frontal matrix. The frontal method was originally proposed for problems 
arising in finite element problems to avoid the need to explicitly construct the system 
matrix A; it was later generalized to non-element problems. It works with a single 
frontal matrix and has less scope for parallelisation compared to the multifrontal 
method; it is no longer widely used. 
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Equating the last r rows and columns in (5.2) and (5.4) yields 


kij 


v=- $ |: legs): (5.5) 

ITO Ny; 
Assume that cj (j = 1,...,s) are the children of k in 7. The set 7 (k) \ {k} is 
the union of disjoint sets of vertices in the subtrees 7 (cj). Each of these subtrees 
is represented in the overall update by the generated element (5.5). Thus, Fx can 
be written in an recursive form using the generated elements of the children of k as 
follows 


Akk Akk «++ Akk 
akk ... 0 

F=f a [Va <>... Pa. (5.6) 
ak kO ... 0 


Here, the operation <$ denotes the addition of matrices that have row and column 
indices belonging to subsets of the same set of indices (in this case, k, k1, .. . , kr); 
entries that have the same row and column indices are summed. This is referred to 
as the extend-add operator. 

Adding a row and column of A and the generated elements into a frontal matrix 
is called the assembly. A variable is fully summed if it is not involved in any rows 
and columns of A that have still to be assembled or in a generated element. Once 
a variable is fully summed, it can be eliminated. A key feature of the multifrontal 
method is that the frontal matrices and the generated elements are compressed and 
stored without zero rows and columns as small dense matrices. Integer arrays are 
used to maintain a mapping of the local contiguous indices of the frontal matrices 
to the global indices of A and its factors. Symmetry allows only the lower triangular 
part of these matrices to be held. Algorithm 5.8 outlines the basic multifrontal 
method. 


ALGORITHM 5.8 Basic multifrontal Cholesky factorization 
Input: SPD matrix A and its elimination tree. 
Output: Factor L such that A = LL’. 


1: fork =1:ndo 


2: Assemble the frontal matrix Fx using (5.6) > Only the lower triangle is 
needed 
3: Perform a partial Cholesky factorization of Fy using (5.3) to obtain column 


k of L and the generated element Vg 
4: end for 
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ALGORITHM 5.9 Multifrontal Cholesky factorization using the assembly tree 
Input: SPD matrix A and its assembly tree. 
Output: Factor L such that A = LL’. 


1: nelim = 0 œ nelim is the number of eliminations performed 

2: for kb = 1 : nsup do > nsup is the number of supernodes 

3: Assemble the frontal matrix Fxp; let J be the number of fully summed 
variables in Fgp 

4: Perform a block partial Cholesky factorization of Fy, to obtain columns 
nelim + 1 tonelim + l of L and the generated element Viz, 

5: nelim = nelim +1 

6: end for 


We have the following observation. 


Observation 5.1 Each generated element V; is used only once to contribute to a 
frontal matrix Fyarent(k). Furthermore, the index list for the frontal matrix Fx is the 
set of row indices of the nonzeros in column k of the Cholesky factor L. 


In practical implementations, efficiency is improved by using the assembly tree 
(Section 4.6) because it allows more than one elimination to be performed at once. 
This is outlined in Algorithm 5.9. Here kb is used to index the frontal matrix on the 
kb-th step (1 < kb < nsup). 

As an example, consider the matrix and its assembly tree given in Figure 4.10. 
The nsup = 5 supernodes are {1, 2}, 3, 4, 5, {6, 7, 8, 9} and so variables 1 and 2 can 
be eliminated together on the first step. Assembling rows/columns 1 and 2 of the 
original matrix, the frontal matrix Fı and generated element V; have the structure 


12 8 9 
1 /* k k O# 8 9 
_2 * * k x _8 f ff 
ae ee =g a) 
9 \x*x * 


where f denotes fill-in entries (only the lower triangular entries are stored in 
practice). Similarly, 


The frontal matrix F3 and generated element V3 are given by 
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F 7 8 
* kK Ox 
Fy=7 | x <h, wee f 
8 \f x 
8 * 
Then 
5 7 8 7 8 


* 
* 
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* 
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6 
7 

F=, to | vy py pri. 
9 


* 
* 
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An important implementation detail is how and where to store the generated 
elements. The partial factorization of Fy», at supernode kb can be performed once 
the partial factorizations at all the vertices belonging to the subtree of the assembly 
tree with root vertex kb are complete. If the vertices of the assembly tree are ordered 
using a depth-first search, the generated elements required at each stage are the 
most recently computed ones amongst those that have not yet been assembled. This 
makes it convenient to use a stack. This affects the order in which the variables are 
eliminated but in exact arithmetic, the results are identical. 

Nevertheless, the memory demands of the multifrontal method can be very large. 
Not only is it dependent on the initial ordering of A but the ordering of the children 
of a vertex in the assembly tree can significantly affect the required stack size. Some 
implementations target limiting stack storage requirements. An attractive feature of 
the multifrontal method is that the generated elements can be held using auxiliary 
storage (in files on disk) to restrict the in-core memory requirements, allowing larger 
problems to be solved than would otherwise be possible. 


5.5 Parallelism Within Sparse Cholesky Factorizations 


Sparse Cholesky factorizations use supernodes and task graphs (the assembly tree 
for the multifrontal method) to control the computation. The number of rows and 
columns in a supernode typically increases away from the leaf vertices and towards 
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the root of the task graph because a supernode accumulates fill-in from its ancestors 
in the task graph. As a result, tasks that are relatively close to the root tend to 
have more work associated with them. On the other hand, the width of the task 
graph shrinks close to the root. In other words, a typical task graph for sparse 
matrix factorization tends to have a large number of small independent tasks close 
to the leaf vertices, but a small number of large tasks close to the root. An ideal 
parallelization strategy that would match the characteristics of the problem is as 
follows. Initially, assign the relatively plentiful independent tasks at or near the leaf 
vertices to parallel threads or processes. This is called task or tree level parallelism; 
it is influenced by the ordering of A. As tasks complete, other tasks become available 
and are scheduled similarly. This continues while there are enough independent 
tasks to keep all the threads or processes busy. When the number of available parallel 
tasks becomes too small, the only way to keep the latter busy is to assign more 
than one to a task. This is termed node level parallelism. The number of threads 
or processes working on individual tasks should increase as the number of parallel 
tasks decreases. Eventually, all threads or processes are available to work on the 
root task. The computation corresponding to the root task is equivalent to factoring 
a dense matrix of the size of the root supernode. 

The multifrontal method is often the formulation of choice for highly parallel 
implementations of sparse matrix factorizations. This is because of its natural data 
locality (most of the work of the factorization is performed in the dense frontal 
matrices) and the ease of synchronization that it permits. In general, each supernode 
is updated by multiple other supernodes and it can potentially update many other 
supernodes during the course of the factorization. If implemented naively, all these 
updates may require excessive locking and synchronization in a shared-memory 
environment or generate excessive message-traffic in a distributed environment. In 
the multifrontal method, the updates are accumulated and channelled along the paths 
from the leaf vertices of the assembly tree to its root vertex. This gives a manageable 
structure to the potentially haphazard interaction among the tasks. 

In Section 1.2.4, bit compatibility was discussed. While different orderings of the 
children of a vertex in the assembly tree do not affect the total number of floating- 
point operations that are performed in the multifrontal method, in finite-precision 
arithmetic changing the order of the assemblies into the frontal matrices can lead to 
slightly different results. Given that the number of children is typically small and 
that large matrices can be partitioned such that summations can be safely performed 
in parallel, the overhead in the multifrontal method of enforcing a defined order of 
the summation is relatively small. By contrast, in the supernodal approach, for each 
data block a number of matrices equal to the block dependencies are summed. Given 
the relatively large numbers (several thousand) for many nodes, an enforced order 
may be detrimental to efficiency. 


5.6 Notes and References 87 


5.6 Notes and References 


Exploiting panels and blocks in both left- and right-looking Cholesky factorization 
algorithms is extremely important. The development of sparse supernodal factor- 
izations for uniprocessors and multiprocessors in the 1990s is discussed by Ng 
& Peyton (1993a,b); Rothberg & Gupta (1993) presents an early comparison of 
various types of block Cholesky factorizations. PaStiX of Hénon et al. (2002) is a 
parallel left-looking supernodal solver that is primarily designed for positive definite 
systems. Rotkin & Toledo (2004) introduce a hybrid left-looking/right-looking 
algorithm and Rozin & Toledo (2005) show that no sparse numerical factorization is 
uniformly better than the others. An up-looking approach, which is fast in practice 
for very sparse matrices, is employed in the widely used CHOLMOD solver of Chen 
et al. (2008). The package HSL_MA87 implements a sparse DAG-based Cholesky 
factorization for shared-memory architectures; further details of the approach can 
be found in Hogg et al. (2010). 

The multifrontal algorithm has its origins in the simpler frontal method of Irons 
(1970), which was developed by the civil engineering community from the 1960s 
onwards to solve the linear systems that arise within finite element methods. At a 
time when the main memory of even the most powerful computers was extremely 
limited, the frontal method was heavily influenced by the need to minimize the 
memory requirements of the linear solver. It was initially designed for SPD banded 
linear systems and was subsequently extended to nonsymmetric problems by Hood 
(1976) and to the symmetric indefinite case by Reid (1981); Duff (1984) generalizes 
the approach to non-element problems. The frontal method proceeds by alternating 
the assembly of the finite elements into a single dense frontal matrix with the 
elimination and update of variables. Once variables have been eliminated they are 
no longer needed during the factorization and so they are removed from the frontal 
matrix and stored elsewhere (for example, not in main memory but on an external 
disk) until needed during the solve phase. This frees up space to accommodate the 
next element to be assembled. Because the frontal method does not use the assembly 
tree, the frontal matrix can be much larger than those in the multifrontal method, 
leading to higher operation counts but also allowing the use of BLAS with larger 
block sizes. Efficient implementations were developed up until the late 1990s. For 
example, by Duff & Scott (1996, 1999), who provide a package MA62 for SPD 
problems in element form that employs a single array of length n, exploits Level 3 
BLAS, and holds the computed factors on disk; a coarse-grained parallel version is 
also available, see Duff & Scott (1994) and Scott (2001). 

The frontal method and the work of Speelpenning (1978) on the so-called 
generalized element method led to the development by Duff & Reid (1983) of the 
multifrontal method for solving general symmetric systems (including systems in 
element form). A detailed matrix-based explanation is given in Liu (1992). The 
method is implemented in some of the most important sparse direct solvers. The 
MUMPS (2022) package, which has been actively developed over many years, 
provides a state-of-the-art distributed memory general-purpose multifrontal solver 
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that uses shared-memory parallelism within each MPI process. Other important 
parallel multifrontal solvers are HSL_MA97 (Hogg & Scott, 2013b) and WSMP 
(2020), while the serial package MA57 of Duff (2004) (which superseded the 
original and perhaps most well-known multifrontal solver MA27 of Duff & Reid, 
(1983)) remains very popular. An attractive feature of HSL MA97 is that it 
computes bit-compatible solutions. HSL_MA77 of Reid & Scott (2009) is designed 
to minimize memory requirements by allowing the factors and the multifrontal stack 
to be efficiently held outside of main memory (an option that is also offered by 
MUMPS). In common with earlier frontal solvers, HSI. _MA77 allows the user to 
input the system matrix in element form (that is, A is not explicitly assembled 
for problems coming from finite element applications but is input one element at 
a time). 

The use of GPUs is well-suited to a multifrontal or supernodal factorization 
because these approaches rely on regular block computations within dense subma- 
trices. Implementing the multifrontal method (including for symmetric indefinite 
matrices) on GPU architectures is discussed in Hogg et al. (2016), while Lacoste 
et al. (2012) and Rennich et al. (2016) present GPU-accelerated supernodal 
factorizations. Discussion of the use of GPUs within direct solvers is included in 
the comprehensive survey of Davis et al. (2016). 
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Chapter 6 N 
Sparse LU Factorizations P 


The closer one looks, the more subtle and remarkable Gaussian 
elimination appears — Trefethen (1985) 


Gaussian elimination is living mathematics. It has mutated 
successfully for the last two hundred years to meet changing 
social needs — Grcar (2011) 


This chapter considers the LU factorization of a general nonsymmetric nonsingular 
sparse matrix A. In practice, numerical pivoting for stability and/or ordering of A to 
limit fill-in in the factors is often needed and the computed factorization is then of a 
permuted matrix P AQ. Pivoting is discussed in Chapter 7 and ordering algorithms 
in Chapter 8. 


6.1 Sparse LU Factorizations and Their Graph Models 


In Chapter 4, graphs were used to describe structural changes during a sparse 
Cholesky factorization. In particular, the elimination tree was shown to play a key 
role and, in the previous chapter, the use of DAGs was discussed. For general 
matrices, there are a number of ways that graphs can be employed. 


6.1.1 Use of Elimination DAGs 


The first graph model uses the elimination DAGs associated with L and U that were 
defined in (2.1)-(2.2). The following observation, which is illustrated in Figure 6.1, 
generalizes Observation 4.1 to nonsymmetric matrices. 


Observation 6.1 [fi > j and uj; 4 O, then the column replication principle 
States 
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Figure 6.1 An illustration of the column and row replication principles of sparse LU factoriza- 
tions. The matrix A is on the left. In the centre, we show in red the filled entries in L resulting 
from the replication of the first column in the second column because u12 4 0. On the right, we 
show in blue the filled entries in U resulting from the replication of the second row in the third 
row because /32 Æ 0. Other filled entries resulting from subsequent steps of the factorization are 
denoted in black. 


S{Li:n,j} Cc S{Li-ni}, 


that is, the pattern of column j of L (rows i to n) is replicated in the pattern of 
column i of L. Analogously, ifi > j and lij + O, then the row replication principle 
states 


S{U jin} e S{U iah 


that is, the pattern of row j of U (columns i to n) is replicated in the pattern of row 
iofU. 


Algorithm 6.1 outlines a basic sparse LU factorization. Here it is assumed that A 
is factorizable so that pivoting is not needed. The remainder of this chapter looks at 
techniques that can be used to develop the approach into an efficient one. 

The following theorem formulates the recursive column replication and the 
replication of nonzeros along rows of L using directed paths in G(U); an analogous 
result holds for the rows of U and directed paths in G (LT). 


Theorem 6.1 (Gilbert & Liu 1993) Assume that for some k < j there is a directed 
path k £2 j. Then 


S{L j:n,k} c S{L jn, j} (6.1) 


Moreover, if li, 4 0 for some i > j, then lis # 0 for all vertices s on this path. 


The next two theorems generalize Theorem 4.3 to A being a general nonsymmetric 
matrix. 


Theorem 6.2 (Gilbert & Liu 1993) [fa;; = 0 andi > j, then there is a filled 
entry lij 4 0 if and only if there exists k < j such that aj, # 0 and there is a 


directed path k IW j. 
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ALGORITHM 6.1 Basic sparse LU factorization 


Input: Nonsymmetric and factorizable matrix A = La + Da + U4. 
Output: LU factorization A = LU. 


1: L=I+LA > Identity plus strictly lower triangular part of A 
2: U = Da + Ua > Diagonal plus strictly upper triangular part of A 
3: for k= 1:n -— 1 do 

4 for i € {i > k| liz # 0} do 

5: lik = lik /Ukk 

6 Ui i:n = Ui, i:n — Uk,inlik > Update row i of U 
7 end for 

8 for j € {j > k| ug; A 0} do 

9 Lji+i:n,j = Lj+in.j — Lj+iin cue > Update column j of L 
10: end for 
11: end for 


Theorem 6.3 (Gilbert & Liu 1993) If aj; = O andi < j, then there is a filled 
entry uij # 0 if and only if there exists k < i such that akj # 0 and there is a 


F 
directed path k a i. 


Theorems 6.2 and 6.3 are demonstrated in Figure 6.2. Consider the directed path 
1> 3 — 5 — 6inG(U). Existence of this path implies the fill-in in L, first in 
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Figure 6.2 The sparsity patterns of A (left) and L +U (right) together with the graphs G (A) (left), 
G(L") (centre) and G (U) (right). The filled entries are denoted by f and the corresponding edges 
are the red dashed lines. 
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Figure 6.3 Example to show the transitive reduction of a DAG. G is on the left, its transitive 
reduction G? is in the centre, and one possible G’ that is equireachable with G is on the right. 


column 3, then in columns 5 and 6. Similarly, the directed path 2 > 4 —> 5 > 6 in 
G(LT) implies fill-in at positions (4, 7), (5, 7) and (6, 7) in U. 


6.1.2 Transitive Reduction and Equireachability 


To employ G(LT) and G (U) in efficient algorithms, they need to be simplified. One 
possibility is to use transitive reductions that are sparser and preserve reachability 
within the graphs. A subgraph G? = (V, €°) is a transitive reduction of G = 
(V, £) if the following conditions hold: 


(T1) there is a path from vertex i to vertex j in G if and only if there is a path from 
ito jin g? (reachability condition), and 

(T2) there is no subgraph with vertex set V that satisfies (T 1) and has fewer edges 
(minimality condition). 


A transitive reduction is unique for a DAG, as shown in the following theorem and 
illustrated in Figure 6.3. 


Theorem 6.4 (Aho et al. 1972) Let G be a DAG. The transitive reduction G 0 of G 
is unique and is the subgraph that has an edge for every path in G and has no proper 
subgraph with this property. 


If S{A} is symmetric, then, as illustrated in Figure 6.4, the role of the transitive 
reduction is played by the elimination tree. 


Theorem 6.5 (Liu 1990; Eisenstat & Liu 2005a) Jf A is symmetrically structured, 
then the transitive reduction of the DAG G(L") (= G(U)) is the elimination tree 
T (A). 


Obtaining the exact transitive reduction of a DAG can be expensive. Instead, 
approximate reductions that drop the minimality condition may be computed. A 
directed graph G’ with the same vertex set as G that satisfies condition (T 1) is said 
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Figure 6.4 The sparsity patterns of L + U of a symmetrically structured A together with the DAG 
G(L’) (left) and the elimination tree T(A) (right). The filled entries are denoted by f and the 
corresponding edges are the red dashed lines. It is straightforward to see that 7 (A) is obtained as 
the transitive reduction of G T): 


to be equireachable with G. The next result is a simplification of Theorem 6.1; an 
analogous result holds for the sparsity patterns of the rows of U. 


Theorem 6.6 (Gilbert & Liu 1993) Assume G’ is equireachable with G(U) and 


for some k < j there is a directed path k = j. Then (6.1) holds. Moreover, if 
liz # 0 for some i > j, then lis 4 0 for all vertices s on the directed path. 


Equireachability enables sparse triangular linear systems to be solved more 
efficiently. In Chapter 5, Theorem 5.2 describes how to obtain the sparsity pattern 
J of the solution of a lower triangular system using paths in G(L’). This graph 
can be replaced by any graph that is equireachable with G(L’). Equireachability 
also allows Theorems 6.2 and 6.3 to be rewritten using paths in a graph 9’ that is 
equireachable with G. 


Theorem 6.7 (Gilbert & Liu 1993) If a;i; = 0 andi > j, then there is a filled 

entry li; # 0 if and only if there exists k < j such that aig # 0 and a directed path 

k AU j, where G' (U) is equireachable with G (U). 

Theorem 6.8 (Gilbert & Liu 1993) If aj; = 0 andi < j, then there is a filled 

entry uij # 0 if and only if there exists k < i such that ax; 4 0 and a directed path 
GILT) . oe T 

k ee i, where G' (L? ) is equireachable with G(L* ). 


Figure 6.5 depicts G(U) and G’(U) for the matrix in Figure 6.2. 
A description of the sparsity patterns of the columns of L can be obtained from 
the Schur complement (3.2) as follows: 
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Figure 6.5 The DAG G(U) for the matrix from Figure 6.2 (left) and G’(U) which is equireachable 
with G(U) (right). 


Shin = SAn U Sigal legen 
k<j,ugj #0 


Theorem 6.7 implies that not all the terms in this union are needed to obtain 
S{L j:n,j}. This result is given in Theorem 6.9, which shows how S{L} can be 
computed by columns if G’(U) that is equireachable with G(U) is known. 


Theorem 6.9 (Gilbert & Liu 1993) 7f G'(U) is equireachable with G(U), then 


S{L jn, j} = S{A jn, j} U S{L jn kh 1 £ J <n. (6.2) 
(k> j)EE(G'(U)) 
Proof Consider an edge (k —> j) in G(U) but not in G’(U). Repeatedly apply- 


ing (6.1) along the directed path k 2u j, we see that L j:n,k is contained in the 
right-hand side of (6.2) and therefore S{L j:n, j} is contained in the right-hand side 
of (6.2). Because the right-hand side of (6.2) is trivially contained in the left-hand 
side, the result follows. oO 


An analogous result holds for the rows of U. 


Theorem 6.10 (Gilbert & Liu 1993) IfG’(L) is equireachable with G(L), then 


S{Ui,in} = S{Aiin)} LJ Skin}, 1 sin. 
(ki)eE(G(L")) 


As an example of Theorem 6.9, consider the matrix in Figure 6.2. Because 
(3 — 5) is the only edge of G’(U) in the union on the right-hand side of (6.2), 
S{L5.7,5} is given by 

S{L5:7,5} = S{As:7,5} U S{Ls:7,3}. 


We can see this from the graph G’(U) in Figure 6.5 (top right). 
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6.1.3 Symbolic LU Factorizations Using DAGs 


Factorization by bordering can be used to obtain S{L} by rows and S{U} by 
columns. Assume the sparsity patterns of the first k — 1 rows of L and the first 
k — 1 columns of U (1 < k < n) have been computed. At step k, the factors satisfy 


Aye = Gras =) 7 eae ') Grn a) 
_ Ák, 1:k—1 akk Lkak-1 1 0 ükk 


(6.3) 
Equating terms for the (2, 1) block, row k of L satisfies 


Lk, 1:k—-1U1:k—1,1:k—-1 = ÁÅk,1:k—1, 


or, equivalently, if y denotes the off-diagonal part of the column k of L7, then it is 
the solution of the lower triangular system 


T T 
Ui:k-1,1:k-1Y = Ak apts 


From Theorem 5.2, the sparsity pattern of y is the set of all vertices reachable in the 
DAG G (U 1:k—-1,1:k—1) (or in a graph that is equireachable with it) from the nonzeros 
in Ag 1:k—1. Similarly, equating terms in (6.3) for the (1,2) block, column k of U 
satisfies 


Li:k—1,1:k-1U1:k—1,k = Á1:k—1,k- 


Again, its sparsity pattern can be determined using Theorem 5.2 and the DAG 
GLT p111) The diagonal entry uz, is then computed as agg — Lk, 1:k-1U1:k—1,k- 
This shows that determining the sparsity patterns of L and U and computing 
their numerical values is coupled: computation of the factors needs be mutually 
interleaved because computing part of one requires information from a part of the 
other. 


6.1.4 Graph Pruning 


Consider the matrices in Figure 6.6. The one in the centre is the same as the one 
on the left except that the entries in positions (4, 6) and (6, 4) have been removed 
(that is, pruned). Both matrices have the same sets of reachable vertices in G (LT) 
and G (U). This suggests how to find G'(LT) and G’(U) that are equireachable with 
G(L’) and G (U), respectively. 


Theorem 6.11 (Eisenstat & Liu 1992) Iffor some j < s bothl;; 4 Oandu js #0, 
then there are no edges (j — k) with k > s in the transitive reductions of G(U) 
and G(L"). 
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Figure 6.6 An example of symmetric pruning. On the left is S{L+ U}. In the centre is the reduced 
sparsity pattern obtained by symmetric pruning. On the right is the reduced sparsity pattern that 
results from symmetric path pruning. 


Proof Let (j — k) be an edge of G(U), that is, uj, # 0. Because /,; 4 0 and 
ujk # 0 implies that us, # 0, there is a path j > s — k in G(U) and the edge 
(j — k) does not belong to the transitive reduction of G(U). The result for G(L") 
can be seen analogously. o 


This theorem implies that if for some s > 1 there are edges 


. GL”) . GU) 
j—s ad j — s, 


then all edges (j —> k) in G(U) and G(LT) with k > s can be pruned. The resulting 
DAGs g'(U) and G'(LT) have fewer edges and are equireachable with G(U) and 
G(L"), respectively. The removal of redundant edges based on Theorem 6.11 is 
called symmetric pruning. 

There are other ways to perform pruning. For example, if for some s > 1 there 
are paths 


7,22, and js, 


then for all k > s symmetric path pruning removes the edges (j —> k) from 
G(U) and G(L’). Consider again Figure 6.6. In the centre is the sparsity pattern 
after symmetric pruning and on the right is the reduced sparsity pattern that results 
from symmetric path pruning. The edge (1 — 6) is not required in G’(L7) or G' (U) 
because there are paths 


GL") G(L) GL") GL") GU) GU) 
1 >2 > 4 > > 


5 6 and 1 — 3 — 6. 
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12343 67 8 9 10 123 4 5 67 8 9 10 
1 * * 1 * * 
2 + oF * * * 2 EO f * * 
3 * * 3 * x x 
4 * Ox $ * 4 * œ e E 
5 $ Ok Ok * 5 x ok œ x f f 
6 * * 6 + f fof *® f f f 
T * * T * * f f 
8 * Ok * 8 x « f f f f $ -f of 
9 * # 9 *x Ox 
10 * * * 10 x f « f fx 


Figure 6.7 An example of the sparsity pattern of a nonsymmetric matrix A (left), S{L + U} with 
filled entries denoted by f (right) and its elimination tree. 


6.1.5 Elimination Trees for Nonsymmetric Matrices 


The elimination DAGs G(L) and G(U) can be combined into a single structure 
called the nonsymmetric elimination tree in which edges are replaced by paths. 
This can be advantageous because it is more compact. From (4.3), if S{A} is 
symmetric, then its elimination tree is defined in terms of the mapping 


parent(j) = min{i |i > j and i; # 0}. 


L LT 
The condition /;; # 0 is equivalent to i = J a> i. In the nonsymmetric case, 


the definition can be generalized using directed paths 


gD) BAUN 


parent(j) = min{i | i > j adi = j }. (6.4) 


This is illustrated in Figure 6.7. Vertices 6, 8, and 10 are the only ones with cycles 
of the form 


G(L) 2 GU) i 


namely, 
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ALGORITHM 6.2 Basic computation of the elimination tree for nonsymmetric A 
Input: Digraph G(A). 
Output: The elimination tree given by the mapping parent. 


1: parent(1:n)=0 
2: fori = 1 : n do 
3: Find the vertex set Vc of the strong component of G (A1:;,1:;) that contains i 


4 for j € Vc \ {i} do 
5 if parent(j) = 0 then 
6: parent(j) =i 
7 end if 
8 end for 
9: parent(i) =0 
10: end for 
g E0 g SU, 5 SU g, g ENa S, g ana eo = i. 


In this example, parent (2) = 6. 
Theorem 6.12, which can be regarded as a generalization of Corollary 4.6, shows 
how the elimination tree for nonsymmetric A can be constructed. 


Theorem 6.12 (Eisenstat & Liu 2005a) Let A be a nonsymmetric matrix. i = 
parent(j) if and only ifi > j and i is the smallest vertex that belongs to the same 
strong component of G(A1:i,1:i) as vertex j. 


This result is employed in Algorithm 6.2. The complexity of finding the strong 
components of a digraph with m edges and n vertices is O(n + m). Hence, the 
complexity of Algorithm 6.2 is O(nz(A)n). More sophisticated approaches with 
complexity O(nz(A) logn) exist. 

To illustrate Algorithm 6.2, consider the matrix and its elimination tree depicted 
in Figure 6.7. The main loop sets the first nonzero value in the array parent when 
i = 3 because this is the first i for which the set Vc \ {i} is non empty; it is equal to 
{1} and thus parent(1) = i = 3. Fori = 4, the vertex set {1, 3, 4} forms a strong 
component of G (A1:4,1:4) and so parent(3) = 4. For i = 5, the single vertex {5} 
is a strong component of G(A1:5,1:5) and, therefore, 5 is not a parent of any other 
vertex (it is a leaf vertex). G(A}.6,1.6) has two strong components with vertex sets 
{1, 3, 4} and {2, 5, 6}. i = 6 belongs to the second of these and thus the algorithm 
sets parent(j) =i = 6 for j = 2 and 5. 

An attractive idea for constructing S{L + U} and subsequently computing the 
LU factorization is based on using the column elimination tree T(AT A). 


Theorem 6.13 (George & Ng 1985; Grigori et al. 2009) Assume all the diagonal 
entries of A are nonzero and let LL" be the Cholesky factorization of AT A. Then 
for any row permutation matrix P such that PA = LU the following holds: 
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1 * 

2 +o foa 

3 * * 

4 *x Ox x f * 

5 x «x f x * 

6 * * 

T \2 x *« fox 
123 4 5 67 

1 * x Me, * * 

2 Ber SR SR Bes Wake E E. 

3 x * fo k x 

4 x x f x «x fox 

5 a E a E E ee 

6 a ee a R e 

7 E e E E E E 


Figure 6.8 The sparsity patterns of A and L + U (top) and of ATA and L + L’, where AT A = 
LL? (bottom). Filled entries are denoted by f. The corresponding elimination trees are also given. 


Sie US S{L +T"). 


An important feature of Theorem 6. 13 is that it holds for any row permutation matrix 
P applied to A. This allows partial pivoting (Section 3.1.2) to be used. The following 
result states that 7 (AT A) represents the potential dependencies among the columns 
in an LU factorization and that for strong Hall matrices no tighter prediction is 
possible from the sparsity structure of A. 


Theorem 6.14 (Gilbert & Ng 1993) If PA = LU is any factorization of A with 
partial pivoting, then the following hold. 


1. If vertex i is an ancestor of vertex j in T (AT A), theni > j. 

2. Ifli; #0, i # j, then vertex i is an ancestor of vertex j in T(AT A). 

3. Ifuij #0, i F j, then vertex j is an ancestor of vertex i in T(AT A). 

4. Suppose in addition that A is a strong Hall matrix. If | = parent (k) in T(AT A), 
then there are values of the nonzero entries of A for which ux # 0. 


Figure 6.8 illustrates the differences in the sparsity patterns of A and AT A and 
of their factors; the corresponding elimination trees are also given. This reveals a 
potential problem with the column elimination tree: S{A7 A} can have significantly 
more entries than S{L + U}. An extreme example is when A has one or more dense 
rows because A’ A is then fully dense. 
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6.1.6 Supernodes in LU Factorizations 


Supernodes group together columns of the factors with the same nonzero structure, 
allowing them to be treated as a dense submatrix for storage and computation. 
When solving SPD systems, supernodes can be determined during the symbolic 
phase. For nonsymmetric matrices, supernodes are harder to characterize. The need 
to incorporate pivoting means it may not be possible to predict the sparsity structures 
of the factors before the numerical factorization and they must be identified on-the- 
fly. While there are several possible ways to define supernodes, the simplest (which 
is widely used in practice) follows the symmetric case and defines a supernode to 
be a set of contiguously numbered columns of L with the triangular diagonal block 
treated as dense and the columns as having the same structure below the diagonal 
block. 

In a Cholesky solver, fundamental supernodes (Section 4.6.1) are made con- 
tiguous by symmetrically permuting the matrix according to a postordering of its 
elimination tree; this does not change the sparsity of the Cholesky factor. For 
nonsymmetric A, before the numerical factorization, T(A! A) can be constructed 
and the columns of A then permuted according to its postordering to bring together 
supernodes. The following result extends Theorem 4.9. 


Theorem 6.15 (Li 1996) Let A have column elimination tree TCA" A). Let p be 
a permutation vector such that if pi is an ancestor of pj in T(At A), theni > j. 

Let pa be the permutation matrix corresponding to p and let A = PAP’. Then 
T(AT A) is isomorphic to T(A' A); in particular, relabelling each vertex i of 
T(A’ A) as pi yields T(AT A). If, in addition, A = LU is an LU factorization 
without pivoting then P™LP and P’ OP are lower triangular and upper triangular 
matrices, respectively, so that A = (P™LP)(P?UP) is also an LU factorization. 


In practice, for many matrices the average size of a supernode is only 2 or 3 
columns and many comprise a single column. Larger artificial supernodes may be 
created by merging vertex j with its parent vertex i in T (AT A) if the subtree rooted 
at i has fewer than some chosen number of vertices. 


6.2 LU Multifrontal Method 


The multifrontal method (Section 5.4) can be generalized to nonsymmetric A 
by modifying the definitions of the frontal matrices and generated elements to 
conform to an LU factorization. But natural generalizations to rectangular frontal 
and generated element matrices do not simultaneously satisfy the statements from 
Observation 5.1. These statements can be rewritten for the LU factorization as 
follows. 


(a) Each generated element V; is used only once to contribute to a frontal matrix. 
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(b) The row and column index lists for the rectangular frontal matrix F; correspond 
to the nonzeros in column L ;-:n,; and nonzeros in row Uj, j:n, respectively. 


These conditions cannot both hold. An approach that satisfies (a) can be based on 
the sparsity pattern of S{A + AT} and storing some explicit zeros if S{A} is not 
symmetric. It is then analogous to the symmetric multifrontal method. In this case, 
although the frontal and generated elements may be numerically nonsymmetric, 
they are square and structurally symmetric. This approach performs well if S{A} 
is close to symmetric, that is, the symmetry index of A is close to unity. 

An approach that satisfies (b) and not necessarily (a) splits the generated elements 
into smaller ones that are embedded into further rectangular frontal matrices. We 
illustrate this using the example from Figure 6.7, that is, 


123 45 678 9 10 
1 * * 
2 Eef * * * 
3 * * k 
4 * Ox * Ok 
5 * Ok «x f f 
6 xf ff TIT 
7 * * f f 
8 xf fff «ff 
9 * Ok 
10 ef & ff E 


where x are entries in A and filled entries in L + U are denoted by f. Taking the 
entries in the first row and column, the sparsity patterns of the first frontal matrix 
and the corresponding generated element are 


To construct F> that satisfies (b) we can only use part of V;. From the row and 
column replication principles, because aj3 + 0, the sparsity pattern of column 1 
is replicated in that of column 3 of the factors. While the entry in position (2, 3) 
belongs to F2, because of the row replication of the sparsity pattern of the first row 
in that of the second row, the remaining entries contribute to F3 and so we split Vj 
into two as follows 


* 
| vi = Vf yy, 


x 
Ro) 
ll 
N 
r aii 
Sy 
m 
NS 
w 
ll 
Co 
fo 
siy 
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where = is the extend-add operator and vè and v? contribute to F2 and F3, 
respectively. Then F> and the corresponding generated element V> are 


2 5 8 10 2 3 5 8 10 3 8 10 

2 [fx * x * 2 {x f x * x 
6/(f f f f 
F=6 |> < vf =6 | * V= ( ) 
(: ) (: aE oe See 


Consider the following splitting of V2 


3 5 8 10 


The next frontal matrix is 


3 4 
4 
3 4 3 [(* x 
3 fk * 4| x x om ie 
R=} <$ vi $V = V3=6 | f 
* Ok 6 if s \7 
8 \f 


The subsequent steps can be described in a similar way. 
Theorem 6.16 expresses the nested relationship between the nonsymmetric 
multifrontal method and the nonsymmetric elimination tree. 


Theorem 6.16 (Eisenstat & Liu 2005b) Assume A is a general nonsymmetric 
matrix and t = parent (k) in T (A). Then 


S{Li:n,k} Cc S{Len,t} and SU Kn} (e S{U tn}. 


Proof Because t is the parent of k, by definition t pis k gu t. If uij A O, 
then a multiple of column i is added to column j during the LU factorization. 


Thus, by a simple induction argument, for each j on the path k gwu t, we must 
have S{L j:n,k} C S{Lj:n, j}. In particular, this holds for column t. The second part 


follows by a similar argument using the path t gu, k. o 


This result shows that the parent relationship in the nonsymmetric elimination 
tree guarantees that both row and column replications can be applied at the same 
time. Hence all entries of the submatrices of the generated element Vg with indices 
greater than or equal to parent (k) can be added to Vparent(k) using the operation 
p . To illustrate this, consider again the 10 x 10 example above for which 
parent(1) = 3. Theorem 6.16 guarantees that V; can be embedded into F3 because 
S{L3:n,1} e S{L3:n,3} and S{U1, 3:n} e S {U3 3:n}. 
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6.3 Preprocessing Sparse Matrices 


We now turn our attention to preprocessing techniques that can help in computing an 
LU factorization. In particular, we consider when A does not have a full transversal 
(that is, it has one or more zeros on the diagonal). For numerical stability and to 
reduce the number of permutations required during the factorization, it can be useful 
to permute A before the factorization begins to put large nonzero entries on the 
diagonal. As an example, consider the matrix A in Figure 6.9. It has a22 = 0 and 
we want to know whether it can be permuted so that all the diagonal entries are 
nonzero. This question and its answer can be formulated in terms of matchings and 
bipartite graphs. 


6.3.1 Bipartite Graphs and Matchings 


Given a graph G = (V,€), an edge subset M C € is called a matching (or 
assignment) if no two edges in M are incident to the same vertex. In matrix terms, a 
matching corresponds to a set of nonzero entries with no two belonging to the same 
row or column. A vertex is matched if there is an edge in the matching incident 
on the vertex, and is unmatched (or free) otherwise. The cardinality of a matching 
is the number of edges in it. A maximum cardinality matching (or maximum 
matching) is a matching of maximum cardinality. A matching is perfect if all the 
vertices are matched. 

A bipartite graph is an undirected graph whose vertices can be partitioned into 
two disjoint sets such that no two vertices within the same set are adjacent, that is, 
each set is an independent set. Let the n x n matrix A have entries {a;;'}. Associated 
with A is a bipartite graph defined as a triple Gy = (Vow, Vcol, E), where the row 
vertex set Voy = {i laij + 0} and the column vertex set Veoi = {j" lai 4 0} 
correspond to the rows and columns of A and there is an (undirected) edge (i, j’) € 
E if and only if a; 4 0. This is illustrated in Figure 6.9. We use prime to distinguish 
between the independent set of row vertices and the independent set of column 
vertices, that is, i denotes a row vertex and i’ denotes a column vertex. 

If A is structurally nonsingular, a matching M in Gz is perfect if it has cardinality 
n. A perfect matching defines ann x n permutation matrix Q with entries qij given 
by 


1, ifG,i)emM, 


Vij = . 
0, otherwise. 


Both QA and AQ have the matching entries on the (zero-free) diagonal. Q and the 
column permuted matrix A Q for the example in Figure 6.9 are given in Figure 6.10. 
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ih PA 3/ 4! 5/ 6’ 


1 * * 

2 * * * 
3 ek 

4 * 

5 * 

6 * Ok * 


Figure 6.9 A sparse matrix and its bipartite graph Gp (left). The matched matrix entries are in 
blue and edges that belong to a perfect matching in Gp are given by the blue dashed lines (right). 
Note that the perfect matching is not unique (an alternative is in Figure 6.11). 


123 4 5 6 BY. gd? Al Oho Bt x6! 
1 Hf 1 * 
2 1 2 $ * 
3 [1 3 * * 
Q= 4 1 AQ = 4 * 
5 1 5 $ 
6 1 6 * * * 


Figure 6.10 The permutation matrix Q and the column permuted matrix AQ corresponding to 
the matrix in Figure 6.9. The matched entries are on the diagonal of AQ. 


6.3.2 Augmenting Paths 


If a perfect matching exists, it can be found using augmenting paths. A path P in 
a graph is an ordered set of edges in which successive edges are incident to the 
same vertex. P is called an M-alternating path if the edges of P are alternately 
in M and not in M. An M-alternating path is an M-augmenting path in G, if it 
connects an unmatched column vertex with an unmatched row vertex. Note that the 
length of an M-augmenting path is an odd integer. 
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ALGORITHM 6.3 Maximum matching algorithm 
Input: An undirected graph. 
Output: Output maximum matching. 


1: Find an initial matching M > For example, M = Ø 
2: while there exists a M-augmenting path P do 

3: M=MOeOP > Augment the matching along P 
4: end while 


Figure 6.11 An illustration of the search for a perfect matching using augmenting paths. On the 
left, the dashed lines represent a matching with cardinality 5. In the centre, the blue line is an 
augmenting path with end vertices 2 and 2’. On the right is the perfect matching with cardinality 6 
that is obtained using the augmenting path. 


Let M and P be subsets of € and define the symmetric difference 
MOP :=(M\P)U(P\ M), 


that is, the set of edges that belongs to either M or P but not to both. If M is 
a matching and P is an M-augmenting path, then M ẹ® P is a matching with 
cardinality |M|-+1. Growing the matching in this way is called augmenting along 
P. The next result shows that augmenting paths can be used to find a maximum 
matching (Algorithm 6.3). 


Theorem 6.17 (Berge 1957) A matching M in an undirected graph is a maximum 
matching if and only if there is no M-augmenting path 


Figure 6.11 demonstrates the procedure. On the left is a bipartite graph with a 
matching with cardinality 5. In the centre, an augmenting path 2 => 3’ => 3 => 
4’ = 4 => 2’ is shown. Augmenting the matching along this path, the cardinality 
of the matching increases to 6 and M @ P is a perfect matching. 
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6.3.3 Weighted Matchings 


While the maximum matching algorithm finds a permutation of A such that 
the permuted matrix has nonzero diagonal entries, there are more sophisticated 
variations that aim to ensure the absolute values of the diagonal entries of the 
permuted matrix (or their product) are in some sense large. This can increase the 
likelihood that the permuted matrix is strongly regular and reduce the need for 
partial pivoting during the LU factorization. The core problem is as follows: given 
an n x n matrix A, find a matching of the rows to the columns such that the 
product of the matched entries is maximized. That is, find a permutation vector 
q that maximizes 


n 
| [ laia: (6.5) 
i=1 


Define a matrix C corresponding to A with entries c;; > 0 as follows: 


ae log(max; |q;;’|) — log |a;;"|, if aj #0 ee 
i os otherwise. 


It is straightforward to see that finding a g that solves (6.5) is equivalent to finding 
aq that minimizes 


n 


> lciql, (6.7) 


i=l 


which is equivalent to finding a minimum weight perfect matching in an edge 
weighted bipartite graph. This is a well-studied problem and is known as the 
bipartite weighted matching or linear sum assignment problem. 

If Gp = (Vrow, Vcol, E) is the bipartite graph associated with A then let G} (C) = 
(Vrow, Veol, E) be the corresponding weighted bipartite graph in which each edge 
(i, j^) € E has a weight cj, > 0. The weight (or cost) of a matching M in GŁ (C), 
denoted by csum(M), is the sum of its edge weights; i.e. 


csum(M) = 5 Ciji- 
G, j)eM 


A perfect matching M in Gp (C) is said to be a minimum weight perfect matching 
if it has smallest possible weight, i.e. csum(M) < csum(M) for all possible perfect 
matchings M. 

The key concept for finding a minimum weight perfect matching is that of a 
shortest augmenting path. An M-augmenting path P starting at an unmatched 
column vertex is called shortest if 
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csum(M ® P) < csum(M @ P) 


for all other possible M-augmenting paths P starting at the same column vertex. 
A matching Me is extreme if and only if there exist u; and v; (which are termed 
dual variables) satisfying 


Cij = Uj + vj’, if (i, jD e€ Me, 


Cij Z Ui + Uj, otherwise. 


(6.8) 


This is employed by the MC64 algorithm. The dual variables will be important 
when we discuss scaling sparse matrices in Section 7.4.2. The MC64 algorithm is 
outlined here as Algorithm 6.4. It starts with a feasible solution and corresponding 
extreme matching and then proceeds to iteratively increase its cardinality by one 
by constructing a sequence of shortest augmenting paths until a perfect extreme 
matching is found. The algorithm can be made more efficient if a large initial 
extreme matching can be found. For example, Step 3 can be replaced by setting 
Ui = min{c;;’| j E S{Ai nh fori € V-ow and vy = min{c;;’ = uil i€ S{Atin,j}} 
for j° € Veo. In Step 4, an initial extreme matching can be determined from the 
edges for which c;; — uj — vj = 0. 

There are a number of potential problems with the MC64 algorithm. First, the 
runtime is hard to predict and depends on the initial ordering of A. Second, it 
is a serial algorithm and as such it can represent a significant fraction of the 
total factorization time of a direct solver. Because the complexity of Step 6 of 
Algorithm 6.4 is O((n +z(A)) log n) and the complexity of Step 7 is O (n) and of 
Step 8 is O(n+nz(A), MC64 has a worst-case complexity of O(n(n+nz(A)) log n). 
In practice, this bound is not achieved and the algorithm is widely used. 


ALGORITHM 6.4 Outline of the MC64 algorithm 
Input: Matrix A. 
Output: A matching M and dual variables u;, vj. 


1: Define the weights c;; using (6.6) 

2: Construct the weighted bipartite graph G (C) = (Vrow, Veo, E) 

3: Set uj = O fori € Veow and v; = min{c;; : (i, j) € E} for j’ € Veo > Initial 
solution 

: Set M = {(i, j)| ui + vj} > Initial extreme matching 

: while M is not perfect do 

Find the shortest augmenting path P with respect to M 

Augment the matching M = M @P 

Update u;, vj so that (6.8) is satisfied for new M > Make M extreme 

: end while 
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6.3.4 Dulmage-Mendelsohn Decompositions 


The importance of preordering A to block triangular form was discussed in 
Section 3.4. The Dulmage-Mendelsohn decomposition is based on matchings and 
is a generalization of the block triangular form. It provides a precise characterization 
of structurally rank deficient matrices and it can be used to reduce the work required 
for an LU factorization. It comprises row and column permutations P and Q such 
that 


Cy Co C3 
Rı (A) Aa A6 
PAQ= R| 0 A As |. (6.9) 


Ra \ 0 0 4 


Here A, is an mı X nı underdetermined matrix (mı < nı or mı = nı = 0), A2 is 
an m X mz square matrix and A3 is an m3 Xx n3 overdetermined matrix (m3 > n3 
or m3 = n3 = 0). It can be shown that AT and A3 are strong Hall matrices but A2 
need not be a strong Hall matrix, in which case it can be permuted to block upper 
triangular form. 

If row and column sets R and C form a maximum matching of A, then R4 and R2 
are subsets of R and |R3 N R| = n3, and C2 and C3 are subsets of C and |C; N C| = 
mı. An example decomposition for a 10 x 10 matrix is given in Figure 6.12. Here 
R = {1,2,...,9} and C = {2,3,..., 10}. 

The coarse Dulmage-Mendelsohn decomposition orders the unmatched 
columns as the first columns in PAQ and orders the unmatched rows as the 
last rows in PAQ. If A is square and has a perfect matching then its coarse 
decomposition has only the matrix A2; otherwise, both A; and A3 are present. 
The coarse decomposition is computed by first finding a maximum matching. 
Assuming it is not a perfect matching, the rows in A; are determined by performing 
depth-first searches from the unmatched columns to find all of the row vertices that 


xo K *K x| * 
* OK * OK * 
* ok 
* * 
* OK Ok 
PAQ = 
Q * å 
* ok | * 
* OK 
* 
* 


Figure 6.12 An example of a coarse Dulmage-Mendelsohn decomposition. The blue entries 
belong to the maximum matching. mı = 3, m2 = 4, m3 = 3, nı = 4, n2 = 4, n3 = 2. Column 1 
and row 10 are unmatched. 
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are reachable from the unmatched columns via alternating augmenting paths. The 
columns in A , are defined to be the union of the set of unmatched columns and 
the set of columns matched with the rows in A. Similarly, the columns in A3 are 
determined by performing depth-first searches from the unmatched rows to find all 
of the column vertices that are reachable from the unmatched rows via alternating 
augmenting paths. The rows in A3 are defined to be the union of the set of rows 
matched to the columns in A3 and the set of unmatched rows. 

It may be possible to further permute the matrix to obtain the fine Dulmage- 
Mendelsohn decomposition. The fine Dulmage-Mendelsohn decomposition com- 
putes P and Q such that A; and A3 are block diagonal matrices in which each 
diagonal block is irreducible, and A2 is block upper triangular with strongly con- 
nected (square) diagonal blocks. Once the coarse decomposition has been computed, 
A, and A3 are searched to find any irreducible blocks and the permutations required 
to place these on the diagonals of A; and A3 are computed. Finally, following 
Section 3.4, strongly connected components in A2 are found and a permutation is 
formed to reduce A? to block upper triangular form (with the strongly connected 
components lying on the diagonal). If A is reducible and nonsingular, the fine 
Dulmage-Mendelsohn decomposition can be used to solve the linear system Ax = b 
using block back-substitution. 


6.4 Notes and References 


Early theoretical results related to sparse LU factorizations can be found in Rose 
& Tarjan (1978), which extends the systematic understanding of the symbolic 
elimination introduced in Rose et al. (1976). A key paper that influenced the 
discussion and development of both the theory and algorithms for predicting 
sparsity structures in LU factorizations is Gilbert (1994) (first available in 1986 
as a Cornell technical report). As the primary and still very useful resource on 
transitive reduction, we refer to Aho et al. (1972); Gilbert & Liu (1993) extend the 
concept of an elimination tree to study sparse LU factorizations of nonsymmetric 
matrices and present theoretical concepts based on DAGs; see also the parallel 
counterpart in Grigori et al. (2007). Ways to simplify symbolic factorizations and 
prune DAGs are discussed in Eisenstat & Liu (1992, 1993a). An elegant treatment 
of both the theoretical and practical aspects of LU factorizations based on DAGs 
and the nonsymmetric elimination tree (including pruning and pivoting) is given in 
a series of three papers by Eisenstat & Liu (2005a,b, 2007). 

Partial pivoting within the sparse column LU factorization is introduced in 
Gilbert & Peierls (1988). This paper influenced not only further developments in 
sparse LU factorizations but also the development of incomplete factorizations. 
Partial pivoting based on the column elimination tree is first discussed in George 
& Ng (1985); see also Gilbert & Ng (1993) and Li (1996) for further use of column 
elimination trees. Further research on exactness of structural predictions is presented 
by Grigori et al. (2009). 
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The proof of Theorem 6.17 is given by Berge (1957) but the result was observed 
earlier (for example, König (1931)). Preordering nonsymmetric matrices using 
matching algorithms is explained in Duff & Koster (1999, 2001). It is based on 
the Hungarian algorithm of Kuhn (1955) and a sparse variant of the shortest path 
algorithm of Dijkstra (1959). Duff and Koster implemented their algorithm in the 
widely used software package MC64. Because MC64 can be expensive to run, there 
has been interest in developing efficient parallel algorithms for finding a perfect 
matching in a weighted bipartite graph (Azad et al., 2020) and also in relaxing the 
optimality requirement to allow the development of cheaper algorithms that can be 
parallelised; see, for example, Hogg & Scott (2015). A classical paper that describes 
the Dulmage-Mendelsohn decomposition is Pothen & Fan (1990). 

The development of supernodal LU factorizations is closely connected with that 
of column LU factorizations. A key paper is by Demmel et al. (1999), in which 
different types of supernodes for nonsymmetric matrices are considered. 

Duff & Reid (1984) describe a symmetric-pattern multifrontal algorithm for non- 
symmetric matrices that generates an assembly tree based on the structure of A+ A’. 
This employs square frontal matrices and can incur a substantial overhead for highly 
nonsymmetric matrices because of unnecessary data dependencies in the assembly 
tree and extra explicit zeros in the artificially symmetrized frontal matrices. Davis 
& Duff (1997) introduce an nonsymmetric-pattern multifrontal algorithm that seeks 
to overcome these deficiencies by using rectangular frontal matrices. This work 
later developed into the package UMFPACK of Davis (2004), while Amestoy & 
Puglisi (2002) propose an nonsymmetric version of the multifrontal method that 
can be regarded as being intermediate between the nonsymmetric-pattern variant 
of UMFPACK and the symmetric-pattern multifrontal method. The Watson Sparse 
Matrix Package (WSMP, 2020) also uses a nonsymmetric multifrontal algorithm. 

Notable early sparse LU solvers were the Yale Sparse Matrix Package (YSMP) 
of Eisenstat et al. (1977) and the Harwell Subroutine Library code MA28 written by 
Duff (1980), followed later by MA48 of Duff & Reid (1996). These codes address 
important practical considerations (for serial computations). Furthermore, the right- 
looking Markowitz packages MA28 and MA48, which are designed particularly for 
highly nonsymmetric matrices, combine the symbolic and numerical factorization 
phases into a single analyse-factorize phase. Contemporary software packages such 
as PARDISO (2022), SuperLU (Li et al., 1999), UMFPACK and WSMP have 
been developed over many years. They provide one of the best ways of under- 
standing the practical value of the ideas presented in research papers and technical 
reports. PARDISO combines left and right-looking updates in a parallel shared- 
memory code that assumes a symmetric nonzero sparsity pattern. SuperLU offers a 
left-looking supernodal variant for sequential machines, SuperLU_MT for shared- 
memory parallel machines, and the right-looking supernodal SuperLU_DIST (Li 
& Demmel, 2003) for highly parallel distributed memory hybrid systems. Demmel 
et al. (1999) and Li (2008) describe the algorithms and performance on various 
machines. The WSMP software is split into a serial and multithreaded single- 
process library for use on a single core or multiple cores on a shared-memory 
machine, and a separate library for distributed memory environments. 
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Chapter 7 A 
Stability, Ill-Conditioning, and cre | 
Symmetric Indefinite Factorizations 


Solving sparse symmetric indefinite systems is more 
problematic. — Ashcraft et al. (1998). 


The factorization of sparse symmetric indefinite systems is 
particularly challenging since pivoting is required to maintain 
stability of the factorization. Pivoting techniques generally offer 
limited parallelism and are associated with significant data 
movement hindering the scalability of these methods — Duff 

et al. (2018). 


Practical computations are invariably based on finite precision arithmetic. Describ- 
ing the accuracy of such computations often uses the concept of stability. Consider a 
computational algorithm z = g(d) for computing z as a function g of given data d. 
The algorithm is said to be backward stable if the computed solution Z is the exact 
solution of 2 = g(d + Ad), where the perturbation Ad is “small” for all possible 
inputs d. What is meant by small depends on the context. For example, if d is based 
on physical measurements that are necessarily inaccurate, Ad is small if it is of the 
same or smaller absolute value as the inaccuracies in determining d. The minimum 
absolute value |Ad| among such perturbations is called the (absolute) backward 
error (or, if divided by |d|, the relative backward error). To distinguish them from 
backward errors, the absolute and relative errors of Z are called forward errors. 
Backward stability is a property of the computational algorithm and to compute 
solutions with a small backward error we need to consider stable algorithms. 

A related concept that influences the quality of the computed solution is ill- 
conditioning. The problem z = g(d) is said to be ill-conditioned if small 
perturbations in the data d can lead to large changes in the computed 2. The 
condition number measures how sensitive the output of a function is to its input. Il- 
conditioning, which is measured in terms of the condition number, is a property of 
the problem. Provided the backward error, forward error, and the condition number 
are defined in a consistent manner, the following approximate inequality holds: 


forward error = condition number x backward error. 
© The Author(s) 2023 113 


J. Scott, M. Tůma, Algorithms for Sparse Linear Systems, Nečas Center Series, 
https://doi.org/10.1007/978- 3-03 1-25820-6_7 


114 7 Stability, Ill-Conditioning, and Symmetric Indefinite Factorizations 


This says that the computed solution to an ill-conditioned problem can have a large 
forward error because even if the computed solution has a small backward error, this 
error can be amplified by a large condition number. By preprocessing the problem 
it may be possible to improve its conditioning. In this chapter, we discuss both 
the stability of numerical factorizations and preprocessing of the linear system to 
improve conditioning. 


7.1 Backward Stability 


We start with a simple backward error result. Here e denotes the machine precision. 


Theorem 7.1 (Demmel 1997; Watkins 2002) Let the computed LU factorization 
ofa matrix A be A+ AA = LU. The perturbation AA that results from using finite 
precision arithmetic satisfies 


IAAllæ < n O©) |IEllool|Tlloo + O(€?). (7.1) 
Moreover, the computed solution x of the linear system Ax = b satisfies 
(A + A'A) £ = b with 


IIA"Allæo < n O(€) [ILlloollU llo + O(€*). (7.2) 
At stage k of Gaussian elimination, the computed diagonal entry a © is termed 
the pivot (1 < k < n). Gaussian elimination breaks down if a zero pivot is 
encountered. Provided A is nonsingular, row interchanges can be incorporated to 
prevent this happening (Theorem 1.1). The systematic use of row permutations is 
called partial pivoting and was introduced in Section 3.1.2. If ja? | is very small 
(compared to other entries in the active submatrix), then it can cause difficulties in 
finite precision arithmetic because the absolute value of the corresponding computed 
multiplier lig = a ja? can then be very large. Partial pivoting can be used to 
ensure |J;x| < 1, that is, the rows of A that have not yet been pivoted on can be 
permuted so that the new pivot satisfies 


(k) (k) 
max |aj,°| < lagg l- 
i>k 


If Px is the row permutation at stage k and P = P,_; P,-2... P1, then the computed 
factors of PA satisfy 
WL Ilo <n and ||Ullo < n Pgrowth||Alloos 


where the growth factor (growth is defined to be 


k 
Perowh = max (laj; | / laij| ). (1.3) 
LJ, i 
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The bounds (7.1) and (7.2) can be rewritten as 


A Alloo < 1° Pgrowth O(€)|IAlloo, IIA’Alloo < n? pgrowth O(€) IIAlloo- 


In practice, these bounds are pessimistic and the actual errors are typically much 
smaller. Because backward stability of an LU factorization is influenced both by 
the initial ordering of A and the pivoting strategy, it is said to be conditionally 
backward stable. 

For a symmetric positive definite (SPD) matrix A, pivoting for stability is not 
needed. The following states that the Cholesky factorization of A is unconditionally 
backward stable, allowing the stable computation of the solution of the correspond- 
ing linear system. 


Theorem 7.2 (Demmel 1997; Watkins 2002) Let the computed Cholesky factor- 
ization of an SPD matrix A be A+ AA = LL’. The perturbation AA that results 
from using finite precision arithmetic satisfies 


|AAlloo < n? O(€) ||Alloo- 
Moreover, the computed solution X of the linear system Ax = b satisfies (A + 
A'A)X = b with 


[A’Alloo < n? O(€) [IAlloo- 

Both the unconditional backward stability of a Cholesky factorization of an 
SPD matrix and the conditional backward stability of an LU factorization of a 
general A make algorithms for solving linear systems that are based on factorizing 
A preferable to computing and applying A~!. The computed inverse is typically 
not the exact inverse of a nearby matrix A + AA for any small perturbation AA. 
Furthermore, the following pessimistic result shows it is impractical to compute and 
store A~!, regardless of how sparse A is. 


Theorem 7.3 (Duff et al. 1988) Jf A is irreducible, then the sparsity pattern 
S{A7!} of its inverse is fully dense. 


Proof Without loss of generality, assume A is factorizable. For if not, there is a 
permutation matrix P such that the LU factorization of the row permuted matrix 
PA is factorizable (Theorem 1.1). In this case, consider PA instead of A because 
for any permutation matrix P the inverse (P A)~! is fully dense if and only if A is 
fully dense. Let K be the matrix of order 2n given by 


K= Ah : 
I, 0 
After applying n elimination steps to K = K®, the order n active submatrix of 
K+) is —A7!. Consider entry (ATD); (1 < i,j < n). Because A is irreducible 
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and the off-diagonal (1, 2) and (2, 1) blocks of K are equal to the identity matrix, 
there is a directed path i => j in G(K) such that the indices of all the intermediate 
vertices on the path are less than or equal to n. Theorem 3.1 and the non-cancellation 
assumption imply (Aq) )ij # 0. It follows that Aq! is fully dense. o 


The above proof implies that entries of A~! correspond to paths in G(A) when 
A is not irreducible. This result is given in the following corollary. 


Corollary 7.4 (Rose & Tarjan 1978; Duff et al. 1988) Jf A is factorizable, then 
(AT)ij #0(1 <i, j <n) ifand only if there exists a path i 2) j. 


7.2 Pivoting Strategies for Dense Matrices 


This section briefly describes the pivoting strategies that are used in LU factor- 
izations of general dense matrices and, in the symmetric indefinite case, in LDLT 
factorizations. Here and in the following sections, all the quantities (such as a) 
are the computed quantities. 


7.2.1 Partial Pivoting 


Partial pivoting interchanges rows at each stage of the factorization to select the 
entry of largest absolute value in its column as the next pivot (Section 3.1.2). If 
partial pivoting is used, it is straightforward to show that the growth factor (7.3) 
satisfies 


n—1 
Pgrowth 2 2 . 


Although the bound can be achieved in nontrivial cases, it is generally extremely 
pessimistic, particularly when n is very large. In practice, Gaussian elimination with 
partial pivoting is often regarded as being a stable algorithm and is the pivoting 
strategy of choice for dense matrices. 


7.2.2 Complete Pivoting 
A much smaller bound can be obtained if complete (or full) pivoting is used. It 
chooses the pivot to be the largest entry (in absolute value) in the active submatrix, 


that is, at stage k the pivot a is chosen so that 


(k) (k) 
max ļ|a;; |< ja 5 
| ij | < lage | 
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In this case, 
Dein OL 4V an YNI (7.4) 


The disadvantages of complete pivoting are that it is expensive (the whole active 
submatrix must be searched for a pivot), and because the test is tougher than for 
partial pivoting, it is more likely that permutations (and hence more data movement) 
will be required. 


7.2.3 Rook Pivoting 


A pivoting strategy that is more restrictive than partial pivoting but cheaper than 
complete pivoting is rook pivoting. Here the pivot is chosen to be the largest entry 
in its row and its column, that is, 


k k k 
max (Jap l, lag) = lage l- 


The strategy takes its name from the fact that the search for a pivot corresponds to 
the moves of a rook in the game of chess. Clearly, the search for a pivot in rook 
pivoting involves at least twice as many comparisons as for partial pivoting and if 
the whole active submatrix has to be searched, then the number of comparisons is 
the same as for complete pivoting. However, in practice, the cost is usually a small 
multiple of the cost of partial pivoting and significantly less than that of complete 
pivoting. The growth factor for rook pivoting satisfies 


Pgrowth < 1.5 n/A logn 


7.2.4 2 x 2 Pivoting 


When the matrix A is symmetric but indefinite, it may not be possible to select 
pivots from the diagonal (for example, if all the diagonal entries of A are zero). If 
rows of A are permuted (so that off-diagonal entries are selected as pivots), then 
symmetry is destroyed, which means an LU factorization must be performed and 
this essentially doubles the cost of the factorization in terms of both storage and 
operation counts. Symmetry can be preserved by extending the notion of a pivot to 
2 x 2 blocks. 
Consider the symmetric indefinite A given by 


a- (i) 
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If 6 = 0, an LDLT factorization in which D is a diagonal matrix does not exist. 
Furthermore, if 6 « 1, then an LDLT factorization with D diagonal is not stable 
because Pgrow:h = 1/5. However, if the LDLT factorization is generalized to allow 
D to be a block diagonal matrix with 1 x 1 and 2 x 2 blocks, then a factorization 
is obtained that preserves symmetry and is nearly as stable as an LU factorization. 
This is illustrated by the factorization of the following 3 x 3 symmetric indefinite 
matrix 


110 100\ /1 0 0\ /1 10 
A={111]/={110]/001]{011])=ZDL’. 
010 001/ \o 1 0/ \o 01 


Here D has one 1 x 1 block and one 2 x 2 block. 

Rook pivoting can be extended to include 2 x 2 pivots. An iterative procedure 
searches for an entry that is simultaneously the largest in absolute value in row i and 
column j of the active submatrix A“). This entry is used to build a symmetric 2 x 2 
pivot; the search terminates prematurely if a suitable 1 x 1 pivot is found, that is, a 
pivot that satisfies a threshold test. The standard choice for the threshold comes from 
requiring the same potential maximal growth in the absolute values of the entries of 
the partially eliminated matrix that results from either two consecutive 1 x 1 pivots 
or one 2 x 2 pivot. It can be shown that the appropriate choice is (1 + /17)/8. In 
this case, the growth factor satisfies 


Perowth < 3nV2 31/24173 |, ,n1/0-D, 


which is only slightly worse than the bound (7.4) for an LU factorization with 
complete pivoting. Note that the number of partially eliminated matrices depends 
on the number of 2 x 2 pivots. If a 2 x 2 pivot is selected at stage k, then the next 
partially eliminated matrix is A“+?). 


7.3 Pivoting Strategies for Sparse Matrices 


7.3.1 Threshold Partial Pivoting 


While the growth factor is important, for sparse matrices the pivoting strategies 
discussed so far lack the scope to preserve sparsity. In the sparse case, it is necessary 
to balance pivoting for stability with limiting the amount of fill-in in the factors. The 
compromise strategy that seeks to achieve this is called threshold partial pivoting, 
which is a generalization of partial pivoting. At stage k of the numerical factorization 
phase of a sparse LU solver, the pivot is selected so that after permuting it to the first 
entry of the active submatrix A™ it satisfies 


k = k 
max jap | < plag | (7.5) 
i>k 


7.3 Pivoting Strategies for Sparse Matrices 119 


where y € (0, 1] is a chosen threshold parameter. It is straightforward to see that 


(k) 


max |a; | < (1+ y7!) max ja 
l l 


-1 ENTA 
ij i | < +y" max |4jj|, 
where nz; is the number of off-diagonal entries in the j-th column of the U factor. 


Furthermore, 


Pgrowth S d+ yoia, 


where NZemax = Maxj;nzj < n — 1. Choosing y = 1 reduces to partial pivoting; 
using a smaller value potentially leads to greater growth in the size of the entries 
in the factors but allows pivots to be chosen that are better able to preserve 
sparsity. The default choice for y is typically between 0.1 and 0.01 but in some 
practical applications much smaller values are sometimes employed to speed up the 
factorization (at the possible cost of less accurate factors). 

A threshold can also be incorporated into rook pivoting. The pivot must then 
be at least y times the absolute value of any other entry in its row and column 
of the active submatrix. Threshold rook pivoting has the potential to limit growth 
more successfully than threshold partial pivoting. In the symmetric case, if pivots 
are selected from the diagonal (to preserve symmetry), threshold partial pivoting is 
the same as threshold rook pivoting. 


7.3.2 Threshold 2 x 2 Pivoting 


If A is a symmetric matrix, then standard fill-reducing ordering algorithms (which 
will be discussed in the next chapter) and the symbolic factorization phase employ 
only the sparsity pattern of A. In general, if A is indefinite, during the numerical 
factorization it is necessary to modify the chosen elimination order to maintain 
stability. As already observed, if symmetry is to be preserved, 1 x 1 and 2 x 2 pivots 
are needed, resulting in an LDLT factorization in which D is a block diagonal matrix 
with | x 1 and 2 x 2 blocks. Limiting the size of the entries of L so that 


ijl <y! (7.6) 


for alli, j, together with a backward stable scheme for solving 2 x 2 linear systems, 
suffices to show backward stability for the entire solution process. 

In the sparse symmetric indefinite case, the stability test for a 1 x 1 pivot in 
column f¢ of the active submatrix at stage k is the standard threshold test 


©; < 4-1), 77 
ee lea [<y larl- (7.7) 
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For a 2 x 2 pivot in rows and columns s and ¢ the corresponding test is 


kb (b\ T! k 
a a MAXj4s,1;i>k lay —ı/1 
< ; (7.8) 
k) K o AO 1 
ast Ait maxj4s,r;i>k lajs | 


where the absolute value of the matrix is interpreted element-wise. If a®) is accepted 
as a1 x 1 pivot, it becomes the next diagonal entry of D and row and column ż are 
permuted (if necessary) to the pivotal position k. The corresponding diagonal entry 


of L is 1 and from the inequality (7.7), the off-diagonal entries of column k of L are 
(k) (k) 


SS St 
(k) aP 
St tt 
becomes the next diagonal block of D and rows and columns s and ¢ are permuted 
(if necessary) to the next two pivotal positions, k and k + 1. The corresponding 
diagonal block of L is the identity matrix of order 2 and inequality (7.8) ensures 


that the off-diagonal entries of these columns of L are bounded in absolute value by 
-1 
yi 


bounded in absolute value by y~!. If ( ) is accepted as a 2 x 2 pivot, it 


In addition to bounding the size of the entries in L, the ability to stably apply the 
inverse of D to a vector is required. This is trivially the case for 1 x 1 pivots, but 
for 2 x 2 pivots it is necessary to check that the determinant ja” a — a% a® 
is sufficiently large and cancellation does not occur during the application of the 
inverse. 

A major difficulty when stability tests are incorporated into sparse factorizations 
is that a pivot satisfying the stability criteria may not exist. We discuss this for 
symmetric indefinite A but the same problem occurs for general A. Consider the 
supernodal approach of Section 5.3 and the nodal matrix shown in Figure 7.1. 
Pivots can only be chosen from the block Ldiag on the diagonal (the block is square 
and symmetric and only its lower triangular part is held) but the entries in the off- 
diagonal block L,ect are involved in the stability tests: large entries in L,ect can 
cause pivot candidates to fail the threshold tests (7.7), (7.8). If Laiag is of order p 
and only q < p pivots can be found that satisfy the tests, then p — q pivots must 
be delayed. That is, the variables that have not been pivoted on are passed up the 
assembly tree to the parent and the columns of the block column corresponding to 
these variables are appended to those of the nodal matrix at the parent. The delayed 
columns are retested at the parent and, if the stability test is still not satisfied, they 
are passed further up the assembly tree (at the root a full set of p pivots can be 
chosen provided the matrix is non-singular and y < 0.5). 

Observe that to be able to test for large entries, all the off-diagonal entries in a 
block column must be fully updated before the block on the diagonal is factorized. 
This means that the factorize_block task and all the solve_block tasks for a block 
column that are used in the SPD case (Section 5.3) are combined into a single 
factorize_column task. Thus there are fewer but larger tasks and this reduces the 
scope for parallelism. 


7.3 Pivoting Strategies for Sparse Matrices 121 


Lidiag 


Lrect 


Figure 7.1 An illustration of a simple nodal matrix. Pivot candidates are restricted to the square 
block Lajag on the diagonal. 


The problem of delayed pivots arises also in the multifrontal method. At each 
stage of the computation there is a dense symmetric indefinite frontal matrix F of 


order n f of the form 
Fii i 
F= 21, (7.9) 
o Fr 


where Fj; is a p x p matrix corresponding to the fully summed variables. Pivots 
can only be selected from Fj; but the numerical values of the entries in F21 must 
be taken into account when testing for stability. If q < p pivots are found, then 


the partial factorization of F is Pr FPE = LrDrLt, where Pr = (* a is 


Lii 


a permutation matrix with P;; of order p, Lr = ( ) with Lj, a unit lower 
21 


Dı 

S 
order q and S a dense matrix of order n p — q. A basic procedure for selecting pivots 
and partially factorizing F is summarized in Algorithms 7.1 and 7.2. Here updating 
means applying the elimination operations. Observe that candidate pivots are only 
permuted to the start of the frontal matrix once they have been accepted (passed 
the stability test). Algorithm 7.2 can be modified for a supernodal factorization, 
replacing the frontal matrix by a supernodal matrix. 

So far, we have assumed that A is nonsingular, but consistent systems of linear 
equations with a (nearly) singular matrix can occur in practice and only minor 
modifications are needed to handle this. When a column is searched, if its largest 
entry is found to have absolute value less than a chosen threshold ô, the column 
(and, by symmetry, the row) is set to zero, the diagonal entry is accepted as a zero 
1 x 1 pivot, and no update pivotal operations are applied to the remaining columns of 
F. This is equivalent to perturbing the entries of A in the pivotal column by at most 


triangular matrix of order q, and Dr = ( ), with D; a block diagonal matrix of 
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ALGORITHM 7.1 Simple partial sparse indefinite factorization 

Input: Symmetric indefinite matrix F of order np of the form (7.9) with Fy, of 
order p; threshold y € (0, 0.5]. 

Output: Updated F; partial factors L p and Dp and permutation Pr. 


1: q =0,t = 0 eq holds the sum of the sizes (1 or 2) of the pivots chosen so far 
2: while q < p do 

3: find_pivot (piv_size) > See Algorithm 7.2 
4: if (piv_size = 0) exit while loop > Failed to find a pivot 
5: q =q + piv_size 

6: Update columns q + 1 to p of F > Right-looking 
7: end while 

8: Apply updates to columns p + 1 to np of F > Left-looking 


ALGORITHM 7.2 Find a pivot in F using threshold partial pivoting 

Input: F, Ly, Dr, Pr, p,q,t, y are accessed from the environment of the call. 
Output: Selected pivot of size piv_size; computed columns q + 1 :q + piv_size 
of Lr and Dr, updated Pr and t. 


1: subroutine find_pivot (piv_size) 

2: piv_size =0 

3: for test = 1 : p — q do 

4 t=t+1;if (t > p)sett=q+1 > Column ż is searched for a pivot 


5: if (there is s such that q + 1 < s < t — 1 and Gs 4 passes 2 x 2 pivot 

st Jtt 
test) then 

6 piv_size =2 

7: Symmetrically permute rows/columns q + 1 ands of F > Update Pr 

8 Symmetrically permute rows/columns q + 2 and t of F > Update Pr 

9 Compute columns q + 1 and q + 2 of Dr and Lr 

10: return 

11: else if (ft: passes 1 x 1 pivot test) then 

12: piv_size = 1 

13; Symmetrically permute rows/columns q + 1 andtof F > Update Pr 

14: Compute column q + 1 of Dr and Lr 

15: return 

16: end if 

17: end for 


18: end subroutine find_pivot 
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6 and the computed factorization is of a nearby singular matrix. It is convenient for 
the subsequent solve phase to store De in place of Dp, with entries on the diagonal 
corresponding to zero pivots set to zero. 


7.3.3 Relaxed and Static Pivoting 


If pivots are delayed during the numerical factorization, then the data structures 
that were set up during the symbolic phase must be modified. This significantly 
complicates the development of general and symmetric indefinite sparse direct 
solvers compared to sparse Cholesky solvers. Furthermore, it increases the operation 
count and memory required to perform the factorization and, more importantly, it 
can severely limit the scope for parallelism. Maintaining stability and using static 
data structures are conflicting objectives. 

If no candidate pivot satisfies the threshold test but the pivot that is nearest to 
satisfying it would satisfy it with a threshold yı < y, then provided yı is at least 
some chosen minimum value, relaxed pivoting accepts this pivot and reduces y 
to yı. The new value yı is employed thereafter. This means that the factorization 
is potentially less stable but, with fewer delayed pivots, the factors may be sparser 
than if the original y was used throughout. 

With relaxed pivoting, delayed pivots can still occur and it may not be possible 
to use static data structures. Static pivoting allows static data structures because it 
permits no delayed pivots. When a candidate pivot is found to be too small (and no 
other eligible candidate passes the stability test), static pivoting replaces it by a user 
defined value. A small value may make the factorization more accurate but can lead 
to large growth in the size of the entries in the factors, while a large value controls 
this growth but reduces the accuracy of the factorization. As well as allowing the use 
of a static task graph and the structures predicted by the symbolic factorization, other 
benefits of static pivoting are improved use of BLAS 3 operations and parallelism 
and, because there is no additional fill-in, load imbalance in a parallel environment 
is less likely to be a problem. However, the factorization need not be stable and 
the factors are of a shifted matrix A + Ds where Ds is a diagonal matrix, and it 
may be necessary to seek to improve the accuracy of the solution using a refinement 
method (see Section 7.4.1). It is also possible that by the time a very small pivot is 
found it is too late to save the stability of the factorization and perturbing the pivot 
effectively just amplifies numerical noise. It is thus essential that static pivoting is 
used with care; it makes an LDLT or LU direct solver less of a “black box solver” 
because the guarantees are much weaker than when threshold partial pivoting is 
used. A more robust approach can be to incorporate the use of shifts into the 
algorithm that calls the linear system solver. For example, a standard technique in 
some optimization algorithms that involve symmetric linear systems is to employ 
regularization. This can avoid the need for an LDLT factorization in favour of a 
stable Cholesky factorization. 
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Observe that if an LDLT factorization of a symmetric indefinite matrix A is 
computed, then the inertia (that is, the number of positive eigenvalues, negative 
eigenvalues and eigenvalues equal to zero) of A can be found by computing the 
eigenvalues of the block diagonal factor D. In some applications, computing the 
inertia may be desired. For example, in interior point methods for minimizing a 
nonlinear objective function subject to constraints, each iteration involves solving 
a sparse symmetric indefinite linear system and it is important that the solution 
method for this system accurately reports the inertia to allow parameters within the 
interior point method to be chosen. One consequence of static pivoting or using a 
small threshold y is that the computed inertia of A is less likely to be accurate. 


7.3.4 Special Indefinite Matrices that Avoid Pivoting 


Symmetric saddle point matrices are indefinite matrices of the form 


T 
A= t a) (7.10) 


where G € R”!*%”! is an SPD matrix, B € R”?*"? is a positive semidefinite matrix 
(including B = 0), and R € R”*"! with nı +2 = n. Such systems include the 
class of F matrices, where B = 0 and each column of R has at most two entries, and 
if there are two entries, they sum to zero. It is of interest to try and symmetrically 
permute A in such a way that the LDLT factorization of the permuted matrix P A PT 
exists without the use of threshold pivoting. This is attractive because it then makes 
the factorization as efficient as for an SPD matrix. 
Define the permutation matrix P to be 


T 
P= [e1, Cnjtl, £2, Cnjt2, +++ nj, n, Eng+1s +; en, | ` 


Then the permuted matrix PAP! has a block form in which each entry Aj, jisa 
2x 2or2 x lor1 x 2or1 x 1 block. In particular, the diagonal blocks are 


bii, nga+1<i<ny. 


The following theorem shows that a 2 x 2 pivot updated by the Schur complement 
of a 1 x 1 pivot is nonsingular and vice versa. 


Theorem 7.5 (Lungten et al. 2018) Let A be the symmetric saddle point 
matrix (7.10). Assume R = (R; R2) is of full rank with Ry € R"2*"2 nonsingular. 
Let G € R"*"! be SPD and partitioned conformally and let B € R"2*"? be 
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positive semidefinite. If A is permuted to the form 


Gu R? | Gp 

Ry) —B R |, 
T T 

Gin R3 G2 


Gi RT 


then the Schur complement of the symmetric indefinite matrix ( R 3 and the 


Schur complement of the SPD matrix G22 are nonsingular. 


A consequence of Theorem 7.5 is that provided R is of full rank and Rj is 
nonsingular then the LDLT factorization of PAP? exists, with 2 x 2 pivots and 1 x 1 
pivots chosen from the diagonal blocks of PAP! in any order. Assume all the 2 x 2 
pivots are selected ahead of the 1 x 1 pivots. If B = 0 and |rjj| => maxj<j<n, |rij| 
(1 <i < n2), then the growth factor is bounded by 22m, 

A potential difficulty is that permutation matrices P, and P, are needed such that 
P, RP, = [R R2] with Rı nonsingular. If P, and P, can be constructed so that 


P.RP, = ~ a (7.11) 


where Rj, is upper triangular with nonzero diagonal entries then the permuted R is 
said to have a trapezoidal form. A simple case where R can be permuted to this 
form is if it satisfies the following one-degree principle. Let R be of full rank and 
let G(R) = (Vrow, Veo, E) be the bipartite graph of R (Section 6.3.1). R can be 
permuted to trapezoidal form if, fork = 1,2,...,m, — 1, the bipartite graph of 
R® has at least one vertex Ik € Veo, of degree one, where RY = Rand R&*? js 
obtained by removing from R® the column vertex jí and its matched row index ix 
together with all edges involving j; or ix. 

To illustrate this, consider the 6 x 8 matrix R in Figure 7.2 and its 
associated bipartite graph G(R). The first column vertex with degree one is 
2'; it is matched with the row vertex 4. Deleting 2’ and 4 removes edges 
{(4, 2’), (4, 3’), (4, 5°), (4,6), (4, 8}. Column vertex 3’ now has degree one; 
it is matched with row vertex 6. Repeating the process gives a perfect matching 
M = {(4, 2’), (6, 3’), (1, 4), (5, 5), (2, 1^), (3, 6’)} together with row and column 
matched vertex sets {4, 6, 1,5,2,3} and Ix, IAS 15 6}, respectively, and 
permutation matrices P, and P, of order 6 and 8 can be defined to obtain the 
trapezoidal form in Figure 7.2. 

If after k > 1 steps all columns of the reduced matrix R“ have degree greater 
than 1, the permuted matrix has the form (7.11) where R11 is k x k upper triangular, 
Rı2 is k x (nı — k) and the (n2 — k) x (nı — k) block R22 has columns of degree 
greater than one. nı — k steps of Gaussian elimination (with partial pivoting) can be 
applied to R22 to complete the transformation of R to trapezoidal form. 
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1’ 2 3/ 4! 5! 6! 7 8/ 2! 3/ 4! 5/ 1’ 6’ 7! 8/ 
1 * 4 /* * * * * 
2 * * 6 * Ok * * 
R=? : P.RP,=1 ü 
"A * Ox * x Pe GS * * 
5 * * Ok 2 * 
6 * ee * 3 * * 


Figure 7.2 Illustration of permuting a full rank matrix to trapezoidal form using the one-degree 
principle. The matrix R and its bipartite graph Gg(R) are given. The edges that belong to the 
perfect matching in G(R) found using the one-degree principle are given by the dashed blue 
lines; the corresponding matrix entries are in blue. The trapezoidal form comprises a 6 x 6 upper 
triangular matrix R; and a 6 x 2 rectangular matrix R2, where P, = [e4, e6, e1, e5, €2, e3]! and 
P. = [e2, €3, €4, €5, €1, €6, €7, eg] are the row and column permutation matrices. 


7.4 Solving Ill-Conditioned Problems 


Ill-conditioning is connected to the input data: a problem is ill-conditioned if small 
changes in the data can lead to large changes in the solution. Assume for the general 
linear system Ax = b that A and b are perturbed by AA and Ab, respectively, 
and the corresponding perturbation of the solution x is Ax, so that the perturbed 
problem 


(A + AA)(x + Ax) = b+ Ab (7.12) 


has been solved. The perturbations in A and b may include both data uncertainty 
and algorithmic errors. Rearranging (7.12), we obtain 


74 Solving Ill-Conditioned Problems 127 


AAx = Ab— AA — AAAx. 


Premultiplying by AT! and considering any norm |j.|| and the corresponding 
subordinate matrix norm yields 


Axl] < ATH (Ad + JAA [cl] + AAT Axl). 
It follows that 
(L— JATH AAI Axl] < ATH (Ab + IAAI le) 


and, provided ||A~!|| || AA|| < 1, this gives the following bound on the absolute 
error 


ATI 


WAxll = aaa 
A= I AAI 


Abl + AAT lel). 


Dividing by ||x|| and using ||b|| < ||A||l ||x||, yields the relative error bound 


K(A) 
A AA||/||A Ab|| /||b])) , 7.13 
IAxli/Iall < oja aag CAAA + AbD, 0413) 
where 
k(A) = [AILAI (7.14) 


is the condition number of the matrix A. The inequality (7.13) shows that the 
condition number is a relative error magnification factor. If we have a stable 
algorithm, then a neighbouring problem has been solved, that is, 


|AAI/IAT + ASIN 


is small. This ensures an accurate solution if «(A) is small. A large condition 
number means that A is close to being singular (« (A) tends to infinity as A tends to 
singularity). 

Observe that the condition number is very dependent on the scaling of A. 
Furthermore, «(A) takes no account of the right-hand side vector b or the fact that 
small entries of A (including zeros) may be known within much smaller tolerances 
than larger entries. 

If the matrix norm is that induced by the Euclidean norm (that is, the 2-norm 
||.l|2) and A is symmetric, then (7.14) becomes 


k (A) = |Amax(A)|/lAmin(A)], (7.15) 
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ALGORITHM 7.3 Iterative refinement of the computed solution of Ax = b 
Input: The vector b and matrix A. 
Output: A sequence of approximate solutions x, x, .... 


1: Solve Ax = b > x) is the initial computed solution 
2: fork =0,1,...do 

3: Compute r® = b — Ax > Residual on iteration k 
4: Solve A dx = r® > Solve correction equation 


5: xk) — y6 4 bx) 
6: end for 


where Amax(A) and Amin(A) are eigenvalues of A of largest and smallest absolute 
values, respectively. This is called the spectral condition number of A. It is 
important when considering convergence of iterative solvers (Section 9.1.2). 


7.4.1 Iterative Refinement 


Iterative refinement can be used to overcome matrix ill-conditioning and improve 
the accuracy of the computed solution. It may also be used after relaxed or static 
pivoting. The basic method is outlined as Algorithm 7.3. Note that the solvers in 
Steps | and 4 do not have to be the same. The traditional and most common approach 
is to use the computed factors of A in both steps. Alternatively, the factors can 
be employed as a preconditioner for an iterative solver in Step 4 (preconditioning 
and iterative solvers are discussed in Chapter 9). Iterative refinement terminates 
when either the norm of the residual vector r is sufficiently close to zero that 
the corresponding correction 6x“ is very small or the chosen maximum number of 
iterations is reached. If there were no roundoff errors in any of the refinement steps, 
the process would converge to the correct solution in a single iteration. In practice, 
the residual generally decreases significantly over the first few iterations before 
stagnating (i.e. reaching a point after which little further accuracy is achieved). If 
the required accuracy has not been achieved, then a possible approach is to switch 
to using the computed factors as a preconditioner for a Krylov subspace solver (see 
Chapter 9). 

Observe that computing r“ in Step 3 uses the original matrix A and if the 
residual is small, a nearby problem will have been solved. This is particularly 
useful when there is uncertainty in the accuracy of the computed factors as an 
approximation to A (for instance, if threshold pivoting or static pivoting has been 
employed). 

There are a number of variants of iterative refinement that involve using different 
precisions for all or part of the process. In traditional iterative refinement, the 
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residuals are computed at twice the working precision (the precision at which 
the data A, b and the solution x are stored). In fixed precision refinement, all 
computations use the same precision. In mixed precision iterative refinement, the 
most expensive parts of the computation (the LU factorization of A and solving the 
correction equation) are performed in single precision and the residual computation 
in double precision. This is attractive because on modern computer architectures 
single precision arithmetic is usually significantly faster than double precision. 
Moreover, holding the factors in single precision substantially reduces the memory 
required and the amount of data movement. The use of half precision (16-bit) 
arithmetic is also a possibility, assuming it is considerably faster than single 
precision, with a proportional saving in energy consumption. 


7.4.2 Scaling to Reduce Ill-Conditioning 


We have discussed the importance of the condition number « (A). If it is large, then 
we would like to reduce it by transforming A. An important way of doing this is by 
scaling A before the numerical factorization begins. 

Consider two nonsingular n x n diagonal matrices S, and Se. Diagonal scaling 
of the system Ax = b transforms it to 

S, A Se y = S, b, y= S7! x. (7.16) 

If A is symmetric, then selecting S, = Se retains symmetry. For a general A, scaling 
and permuting to bring large entries onto the diagonal can reduce the need for 
numerical pivoting, resulting in fewer delayed pivots, less fill-in, faster factorization 
and solve times, and a reduction in the storage requirements. But finding a good 
scaling can represent a significant overhead (especially within a parallel solver) 
and there are limits on the reduction in «(A) that can be achieved by scaling, as 
illustrated by the following result. 


Theorem 7.6 (van der Sluis 1969) Let the matrix A be SPD and let D4 be the 
diagonal matrix with entries aji (1 < i < n). Then for all diagonal matrices D with 
positive entries 


k(D7 A Dy") < NZpmax k(D7 "2 A D7"), 


where nzrmax is the maximum number of entries in a row of A. 


We remark that the original (unscaled) matrix A should be retained for iterative 
refinement of the computed solution. Using the scaled matrix generally results in 
a larger residual for the original system because, in effect, a perturbed system is 
solved. 
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Equilibration Scaling 


How to find an appropriate scaling is an open question, but a number of heuristics 
have been proposed. An obvious choice is to seek to balance entries of the 
scaled matrix S,AS, to have approximately equal absolute values. This is called 
(approximate) equilibration scaling. It is a natural scaling if the numerical values of 
the entries of A correspond to physical quantities that are measured using different 
scales. 

One approach to equilibration scaling that is relatively cheap as well as easy 
to implement is to select the diagonal scaling matrices so that the infinity norm 
of each row and column of the scaled matrix is approximately equal to unity. 
Algorithm 7.4 presents an iterative procedure for computing such a scaling. Observe 
that this preserves symmetry. In the nonsymmetric case, Algorithm 7.4 yields the 
same results when applied to A and A” in the sense that the scaled matrix obtained 
for AT is the transpose of that for A. 

The infinity norm in Algorithm 7.4 may be replaced by the 1-norm, resulting in 
a matrix whose row and column sums are exactly one (this is sometimes called 
a doubly stochastic matrix). It can be advantageous to combine the use of the 
infinity and one norms. For example, by performing one step of infinity norm scaling 
followed by one or more steps of one norm scaling. 


ALGORITHM 7.4 Equilibration scaling in the infinity norm 
Input: The matrix A and convergence tolerance ô > 0. 
Output: Diagonal scaling matrices S, and Se. 


1: B® =A, D =1, EV =I 


2: fork = 1,2, ... do 
w 


i En llo and B® lo, l <i<n  »i-throw and column of 


lini 
B® 
lool] <8 and max; |11- 1B} jllool} <5 then 


3: Compute || B 


i, l:n 


4 ifmaxi fi — B® 
exit for loop 


i k s k 
5: R = diag ( (3 le) and C= diag ( IBË lle 


6: B&D = RIB C7, D&D = pW R-1, FED = FH c! 
7: end for 
8: S, = D&D and Se = Et) 


Matching-Based Scalings 


In Section 6.3.3, we discussed weighted matchings. In particular, the problem of 
finding a permutation vector g that maximizes the product 
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n 
| J laig. 
i=l 


The entries ajg, corresponding to the solution q are the matched entries. The dual 
variables u; and v; computed by the MC64 algorithm (Algorithm 6.4) that seeks to 
compute q can be used to calculate a scaling as follows. Define the diagonal scaling 
matrices S$, and Se to have entries 


(S-)ii = exp(ui) and (Se)jj = exp(vj — log(max |ajj|)), <i, j <n. 
L 
The entries of the scaled matrix S$, A Sc satisfy 


= if (i, j) € M, 
ICS-ASo)ijl n 
<1, otherwise, 


where M is the maximum weighted matching computed by the MC64 algorithm. If 
A is symmetric, let S be the diagonal matrix with entries 


(S)it = V (Sr)ii Scdii- 


Then the symmetric matrix SAS has the same property. 


Combining Matching-Based Scalings and Orderings 


The matching-based ordering and scaling can be used independently but they can 
also be combined. After scaling, if the matched entries are non-symmetrically 
permuted onto the diagonal, then because they are large, they provide good pivot 
candidates for an LU factorization. This approach is commonly used alongside 
static pivoting to obtain a factorization of a perturbed matrix, followed by iterative 
refinement to recover the solution to the original system. 

In the symmetric indefinite case, symmetry needs to maintained and so the 
objective is to symmetrically permute a large off-diagonal entry a;; onto the 


Aji dij 


subdiagonal to give a 2 x 2 block ( ) that is potentially a good 2 x 2 candidate 


li; ai 
pivot. Assume that a matching M Tas bees computed using the MC64 algorithm 
and let q be the corresponding permutation vector. Any diagonal entries that are 
in the matching are immediately considered as potential 1 x 1 pivots and are held 
in a set M1. A set M2 of potential 2 x 2 pivots is then built by expressing q in 
terms of its component cycles. A cycle of length 1 corresponds to an entry aj; in 
the matching. A cycle of length 2 corresponds to two vertices i and j, where aj; 
and aj; are both in the matching. k potential 2 x 2 pivots can be extracted from 
even cycles of length 2k or from odd cycles of length 2k + 1. A straightforward 
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* * * * * * 
* * * * * * * * * 
* * * * * * * x 
* * * * * * 
* * * * * * 


Figure 7.3 An illustration of a symmetric matching for a symmetric indefinite matrix. On the left 
is the matching M returned by the MC64 algorithm and in the centre is a symmetric matching 
Ms obtained from M. Entries in the matching are in blue. The pairs (7, j) = (1, 2) and (3, 5) are 
possible 2 x 2 pivot candidates. On the right is the compressed matrix that results from combining 
rows and columns | and 2 and rows and columns 3 and 5. 


way to do this is to take the first two entries as the first 2 x 2 pivot, the next two 
as the next 2 x 2 pivot, and so on, until if the cycle is of odd length, a single entry 
remains, which is added to the set M1. In practice, most cycles in q are of length 
1 or 2. A simple example is given in Figure 7.3. Here the matching from MC64 is 
M = {(1, 2), (2,5), (3, 1), (4, 4), (5, 3)}, which is nonsymmetric. q has one cycle 
of length 4 (1 —> 2 > 5 — 3 => 1) and one of length 1, giving Mı = {(4, 4)} and 
M2 = {(1, 2), 2, 1), G, 5), (5, 3)}. 

Let Ms = M,UM 2 be the resulting symmetric matching obtained from M and 
let Qs be the corresponding permutation matrix. To combine Qs with a fill-reducing 
ordering (such as nested dissection or minimum degree), Q,AQ! is compressed. 
The union of the sparsity structure of the two rows and columns belonging to a 
potential 2 x 2 pivot is built and used as the structure of a single row and column 
in the compressed matrix. A fill-reducing ordering algorithm is then applied to 
the (weighted) compressed graph, and the computed permutation is expanded to 
a permutation Q f for Q;AQ!. The final permutation matrix is the product Q f Qs. 
The rows/columns of a potential 2 x 2 pivot are ordered consecutively. 

This approach can reduce the overall computational cost when solving tough 
indefinite systems for which non-matching based orderings require substantial 
modifications to the pivot sequence during the numerical factorization to maintain 
stability. Unfortunately, although after applying the matching-based scaling and 
ordering there are pivot candidates with large entries, there is still no guarantee 
that the computed pivot sequence will not need modifying during the factorization. 
An important disadvantage of using matchings are that the numerical values of the 
entries of A are used so that, if a series of matrices with the same sparsity pattern but 
different numerical values need to be factorized (such as occurs when an iterative 
method is used to solve a nonlinear system), the whole symbolic factorization phase 
may have to be rerun for each matrix, potentially adding significantly to the total 
solution time. 
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7.5 Notes and References 


There are many related but different results on the stability of matrix factorizations. 
While the seminal book of Higham (2002) includes component-wise accuracy and 
stability analysis (see also the classical text (Wilkinson, 1961), which introduced 
the terms partial pivoting and complete pivoting), the norm-wise results given in 
Section 7.1 are based on Demmel (1997); see also Watkins (2002). 

Rook pivoting is introduced in Neal & Poole (1992) and analysed in Foster 
(1997). Early pivoting strategies for dense symmetric indefinite systems are pre- 
sented in Bunch & Parlett (1971), Bunch (1971), and Bunch & Kaufman (1977). 
Static pivoting in sparse LU factorizations was first proposed by Li & Demmel 
(1998). A comprehensive overview of threshold-based pivoting strategies for dense 
and sparse symmetric indefinite problems is given in Ashcraft et al. (1998). This 
includes symmetric rook pivoting for dense problems and a discussion of the 
sparse 2 x 2 threshold partial pivoting strategy of Duff & Reid (1983), which was 
subsequently modified in Duff et al. (1991), and forms the basis of the approach 
of Section 7.3.2. Further implementation details (including incorporating working 
with blocks) are found in Reid & Scott (2011) (see also Hogg & Scott, 2013c). More 
recently, there has been work on new strategies that seek to offer greater potential 
for exploiting parallelism without sacrificing numerical robustness, including Hogg 
& Scott (2014), Hogg et al. (2016), and Duff et al. (2018). 

Avoiding the need to pivot for special classes of indefinite matrices is from 
Lungten et al. (2018) (but see also Tuma, 2002 and de Niet & Wubs, 2009). 
Duff & Pralet (2005) and Schenk & Gärtner (2006) use weighted matchings 
for preprocessing, the latter implementing their strategy within the initial version 
of the solver PARDISO. The HSL mathematical software library (HSL, 2022) 
includes a number of packages that are designed for symmetric indefinite systems, 
most notably the multifrontal codes MA57 (Duff, 2004) and HSL_MA97, and the 
supernodal DAG-based code HSL_MA86 (Hogg & Scott, 2013b). In these solvers, 
the default setting for the threshold pivoting parameter y is 0.01, although when 
used within the well-known interior point solver IPOPT, 2022), a value of 1078 is 
recommended (see also Saunders, 1996). Other well-known sparse direct solvers 
that handle symmetric indefinite systems include MUMPS (2022) and WSMP 
(2020). 

The technique of iterative refinement was introduced by Doolittle (1878). It was 
probably first used in a computer program for improving the computed solution to 
a linear system by Wilkinson (1948), during the design and building of the ACE 
computer at the National Physical Laboratory; see also Wilkinson (1963) and Moler 
(1967). The book by Higham (2002) is an essential reference. For sparse systems, 
the paper by Arioli et al. (1989) is of interest. Hogg & Scott (2010) employ iterative 
refinement within a sparse mixed precision multifrontal solver. More recently, with 
a focus on dense systems, Carson & Higham (2017, 2018) and Carson et al. (2020) 
propose an alternative form of mixed precision iterative refinement that is able 
to handle highly ill-conditioned problems by solving for the correction using the 
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GMRES iterative method preconditioned by the computed LU factors. The survey 
by Abdelfattah et al. (2021) provides a comprehensive review of work on the use of 
mixed precision in numerical linear algebra. 

For Theorem 7.6, we refer to van der Sluis (1969). The equilibration scaling in 
the infinite norm that is outlined in Algorithm 7.4 is given by Ruiz (2001) (see also 
Liu, 2015). Matching-based scalings are presented in Duff & Koster (1999, 2001), 
but see also Neumaier & Olschowka (1996) as well as the origins of the scaling 
factors in Edmonds (1965). 
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Chapter 8 N 
Sparse Matrix Ordering Algorithms P 


The computational complexity of obtaining optimal reorderings 
for performing sparse Gaussian elimination justifies the 
heuristic nature of all practical reordering algorithms. — 
Erisman et al. (1987). 


So far, our focus has been on the theoretical and algorithmic principles involved in 
sparse Gaussian elimination-based factorizations. To limit the storage and the work 
involved in the computation of the factors and in their use during the solve phase 
it is generally necessary to reorder (permute) the matrix before the factorization 
commences. The complexity of the most critical steps in the factorization is highly 
dependent on the amount of fill-in, as can be seen from the following observation. 


Observation 8.1 The operations to perform the sparse LU factorization A = LU 
and the sparse Cholesky factorization A = LL" are O( i |coli{j}| |rowuti}|) 
and O( j= |colL{ j}? ) respectively, where |rowy{j}| and |col,{j} are the 


number of off-diagonal entries in row j of U and column j of L, respectively. 


The problem of finding a permutation to minimize fill-in is NP complete and thus 
heuristics are used to determine orderings that limit the amount of fill-in; we refer 
to these as fill-reducing orderings. Frequently, this is done using the sparsity pattern 
S{A} alone, although sometimes for non-definite matrices, it is combined with the 
numerical factorization because additional permutations of A may be needed to 
make the matrix factorizable. Two main classes of methods that work with S{A} 
are commonly used. 


Local orderings attempt to limit fill-in by repeated local decisions based on G (A) 
(or a relevant quotient graph). 

Global orderings consider the whole sparsity pattern of A and seek to find a 
permutation using a divide-and-conquer approach. Such methods are normally 
used in conjunction with a local fill-reducing ordering, as the latter generally 
works well for problems that are not really large. 
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It is assumed throughout this chapter that A is irreducible. Otherwise, if S{A} 
is symmetric, the algorithms are applied to each component of G(A) independently 
and n is then the number of vertices in the component. If S{A} is nonsymmetric, we 
assume that A is in block triangular form and the algorithms are used on the graph 
of each block on the diagonal. We also assume that A has no rows or columns that 
are (almost) dense. If it does, a simple strategy is to remove them before applying 
the ordering algorithm to the remaining matrix; the variables corresponding to the 
dense rows and columns can be appended to the end of the computed ordering to 
give the final ordering. 

Historically, ordering the matrix A before using a direct solver to factorize it was 
generally cheap compared to the numerical factorization cost. However, in the last 
couple of decades, the development of more sophisticated factorization algorithms 
and their implementations in parallel on modern architectures has affected this 
balance so that the ordering can be the most expensive step. If a sequence of 
matrices having the same sparsity pattern is to be factorized, then the ordering 
cost and the cost of the symbolic factorization can be amortized over the numerical 
factorizations. If not, it is important to have available a range of ordering algorithms 
because using a cheap but less effective algorithm may lead to faster complete 
solution times compared to using an expensive approach that gives some savings in 
the memory requirements and operation counts but not enough to offset the ordering 
cost. 


8.1 Local Fill-Reducing Orderings for Symmetric S{A} 


In the symmetric case, the diagonal entries of A are required to be present in S{A} 
(thus zeros on the diagonal are included in the sparsity structure). The aim is to 
limit fill-in in the L factor of an LL? (or LDL") factorization of A. Two greedy 
heuristics are the minimum degree (MD) criterion and the local minimum fill (MF) 
criterion. 


8.1.1 Minimum Fill-in (MF) Criterion 


One way to reduce fill-in is to use a local minimum fill-in (MF) criterion that, at 
each step, selects as the next variable in the ordering one that will introduce the least 
fill-in in the factor at that step. This is sometimes called the minimum deficiency 
approach. While MF can produce good orderings, its cost is often considered to be 
prohibitive because it requires the updated sparsity pattern and the fill-in associated 
with the possible candidates must be determined. The runtime can be reduced using 
an approximate variant (AMF) but it is not widely implemented in modern sparse 
direct solvers. 
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8.1.2 Basic Minimum Degree (MD) Algorithm 


The minimum degree (MD) algorithm is the best-known and most widely used 
greedy heuristic for limiting fill-in. It seeks to find a permutation such that at each 
step of the factorization the number of entries in the corresponding column of L is 
minimized. This metric is easier and less expensive to compute compared to that 
used by the minimum fill-in criterion. If G(A) is a tree, then the MD algorithm 
results in no fill-in but, in most real applications, it does not minimize the amount 
of fill-in exactly. 

The MD algorithm can be implemented using G(A) and it can predict the 
required factor storage without generating the structure of L. The basic approach 
is given in Algorithm 8.1. At step k, the number of off-diagonal nonzeros in a row 
or column of the active submatrix is the current degree of the corresponding vertex 
in the elimination graph G*. The algorithm selects a vertex of minimum current 
degree in G* and labels it vg, i.e. next for elimination. The set of vertices adjacent to 
vk in G(A) is Reach(vuz, Vk), where Vz is the set of k — 1 vertices that have already 
been eliminated. These are the only vertices whose degrees can change at step k. If 
u € Reach(vk, Vk), u # vg, then its updated current degree is |Reach(u, Vk+1)l, 
where Vk+1 = Vk U ug. 

At Step 4 of Algorithm 8.1, a tie-breaking strategy is needed when there is more 
than one vertex of current minimum degree. A straightforward strategy is to select 
the vertex that lies first in the original order. For the example in Figure 8.1, vertices 
2, 3, and 6 are initially all of degree 2 and could be selected for elimination; as the 
lowest-numbered vertex, 2 is chosen. After it has been eliminated, vertices 3, 5, and 
6 have current degree 2 and so vertex 3 is next. As all the remaining vertices have 
current degree 2, vertex | is eliminated, followed by 4, 5, and 6. It is possible to 
construct artificial matrices showing that some systematic tie-breaking choices can 
lead to a large amount of fill-in but such behaviour is not typical. 


ALGORITHM 8.1 Basic minimum degree (MD) algorithm 
Input: Graph G of a symmetrically structured matrix. 
Output: A permutation vector p that defines a new labelling of the vertices of G. 


1: Set G! = G and compute the degree deggi(u) of all u € Vg!) 

2: fork = 1 : n — 1 do 

3: Compute mdeg = min{deggk(u) |u € VGE} > mdeg is the current 

minimum degree 

Choose v € V(G*) such that deggk (vk) = mdeg 
p(k) = vk > vg is the next vertex in the elimination order 
Construct G**+! and update the current degrees of its vertices 

end for 


Roast Oy we SS 


: p(n) = Up where vp is the only vertex in G” 
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Figure 8.1 An illustration of three steps of the MD algorithm. The original graph G and the 
elimination graphs g?, G? and G* that result from eliminating vertex 2, then vertex 3 and then 
vertex 1 are shown red dashed lines denote fill edges. 


The construction of each elimination graph G‘t! is central to the implementation 
of the MD algorithm. Because eliminating a vertex potentially creates fill-in, 
an efficient representation of the resulting elimination graph that accommodates 
this (either implicitly or explicitly) is needed. Moreover, recalculating the current 
degrees is time consuming. Consequently, various approaches have been developed 
to enhance performance; these are discussed in the following subsections. 


8.1.3 Use of Indistinguishable Vertices 


In Section 3.5.1, we introduced indistinguishable vertices and supervariables. The 
importance of exploiting these in MD algorithms is emphasized by the next two 
results. Here G, denotes the elimination graph obtained from G when vertex v € 
V(G) is eliminated. 


Theorem 8.1 (George & Liu 1980b, 1989) Let u and w be indistinguishable 
vertices in G. If v € V(G) with v 4 u, w, then u and w are indistinguishable in 


Gy. 


Proof Two cases must be considered. First, letu ¢ adjg{v}. Then w ¢ adjg{v} and 
if v is eliminated, the adjacency sets of u and w are unchanged and these vertices 
remain indistinguishable in the resulting elimination graph Gy. Second, let u, w € 
adjg{v}. When v is eliminated, because u and w are indistinguishable in G, their 
adjacency sets in G, will be modified in the same way, by adding the entries of 
adjg{v} that are not already in adjg{u} and adjg{w}. Consequently, u and w are 
indistinguishable in Gy. oO 


Figure 8.2 demonstrates the two cases in the proof of Theorem 8.1. Here, u and 
w are indistinguishable vertices in G. Setting v = v’ illustrates u ¢ adjg{v}. If 
v’ is eliminated, then the adjacency sets of u and w are clearly unchanged. Setting 
v = v” illustrates u, w € adjg{v}. In this case, if v” is eliminated, then vertices s 
and t are added to both adjg{u} and adjg{w}. 
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Figure 8.2 An example to illustrate Theorem 8.1. u and w are indistinguishable vertices in G; 
adjg{u} = {r, w, v”} and adjg{w} = {r, u, v"}. 


Figure 8.3 An illustration of Theorem 8.2. Vertices u and w are of minimum degree (with degree 
mdeg = 3) and are indistinguishable in G. After elimination of w, the current degree of u is 
mdeg — | and the current degree of each of the other vertices is at most mdeg — 1. Therefore, u 
is of current minimum degree in Gwu. Note that vertices r and v are also of minimum degree and 
indistinguishable in G; they are not neighbours of w and their degrees do not change when w is 
eliminated. 


Theorem 8.2 (George & Liu 1980b, 1989) Let u and w be indistinguishable 
vertices in G. If w is of minimum degree in G, then u is of minimum degree in Gy. 


Proof Let degg(w) = mdeg. Then degg(u) = mdeg. Indistinguishable vertices 
are always neighbours. Eliminating w gives degg (u) = mdeg — | because w is 
removed from the adjacency set of u and there is no neighbour of u in Gwu that was 
not its neighbour in G. If x # w and x € adjg{u}, then the number of neighbours 
of x in Gy is at least mdeg — 1. Otherwise, if x ¢ adjg{u}, then its adjacency set in 
Gy is the same as in G and is of the size at least mdeg. The result follows. oO 


Theorem 8.2 is illustrated in Figure 8.3. 

Theorems 8.1 and 8.2 can be extended to more than two indistinguishable 
vertices, which allows indistinguishable vertices to be selected one after another in 
the MD ordering. This is referred to as mass elimination. Treating indistinguishable 
vertices as a single supervariable cuts the number of vertices and edges in the 
elimination graphs, which reduces the work needed for degree updates. 
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In the basic MD algorithm, the current degree of a vertex is the number of 
adjacent vertices in the current elimination graph. The external degree of a vertex 
is the number of vertices adjacent to it that are not indistinguishable from it. The 
motivation comes from the underlying reason for the success of the minimum degree 
ordering in terms of fill reduction. Eliminating a vertex of minimum degree implies 
the formation of the smallest possible clique resulting from the elimination. If mass 
elimination is used, then the size of the resulting clique is equal to the external 
degree of the vertices eliminated by the mass elimination step. Using the external 
degree can speed up the time for computing the ordering and give worthwhile 
savings in the number of entries in the factors. 


8.1.4 Degree Outmatching 


A concept that is closely related to that of indistinguishable vertices is degree 
outmatching. This avoids computing the degrees of vertices that are known not 
to be of current minimum degree. Vertex w is said to be outmatched by vertex u if 


adjg{u} U {u} C adjg{w} U {w}. 


It follows that degg(u) < degg(w). A simple example is given in Figure 8.4. 
Importantly, degree outmatching is preserved when vertex v € G of minimum 
degree is eliminated, as stated in the following result. 


Theorem 8.3 (George & Liu 1980b, 1989) In the graph G let vertex w be 
outmatched by vertex u and vertex v (v # u, w) be of minimum degree. Then w 
is outmatched in G, by u. 


Proof Three cases must be considered. First, if u ¢ adjg{v} and w ¢ adjg{v}, then 
the adjacency sets of u and w in G, are the same as in G. Second, if v is a neighbour 
of both u and w in G, then any neighbours of v that were not neighbours of u and 


Figure 8.4 An example G in which vertex w is outmatched by vertex u. v’ is not a neighbour of 
u or w; vertex v” is a neighbour of both u and w; v” is a neighbour of w but not of u. 
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w are added to their adjacency sets in Gy. Third, if u ¢ adjg{v} and w € adjg{v}, 
then the adjacency set of u in G, is the same as in G but any neighbours of v that 
were not neighbours of w are added to the adjacency set of w in G,. In all three 
cases, w is still outmatched by u in Gy. oO 


The three possible cases for v in the proof of Theorem 8.3 are illustrated in 
Figure 8.4 by setting v = v’, v” and v”, respectively. An important consequence of 
Theorem 8.3 is that if w is outmatched by u, then it is not necessary to consider w 
as a candidate for elimination and all updates to the data structures related to w can 
be postponed until u has been eliminated. 


8.1.5 Cliques and Quotient Graphs 


From Parter’s rule, if vertex v is selected at step k, then the elimination matrix that 
corresponds to G‘t! contains a dense submatrix of size equal to the number of off- 
diagonal entries in row and column v in the matrix that corresponds to G*. For large 
matrices, creating and explicitly storing the edges in the sequence of elimination 
graphs is impractical and a more compact and efficient representation is needed. 
Each elimination graph can be interpreted as a collection of cliques, including the 
original graph G, which can be regarded as having |€| cliques, each consisting of 
two vertices (or, equivalently, an edge). This gives a conceptually different view of 
the elimination process and provides a compact scheme to represent the elimination 
graphs. The advantage in terms of storage is based on the following. 


Let {Vi, V2,..., Vq} be the set of cliques for the current graph and let v 
be a vertex of current minimum degree that is selected for elimination. Let 
{Vs,, Vso. +++» Vs,} be the subset of cliques to which v belongs. Two steps are then 
required. 


1. Remove the cliques {V5,, Vs,,..., Vs,} from {V1, V2,..., Vg}. 
2. Add the new clique Vy = {Vs, U...U Vs, } \ {v} into the set of cliques. 


Hence 


t 
degg(v) = |V] < X` IVs], 
i=l 


and because {Vs,, Vss, -.-, Vs, } can now be discarded, the storage required for the 
representation of the sequence of elimination graphs never exceeds that needed 
for G(A). The storage to compute an MD ordering is therefore known beforehand 
in spite of the rather dynamic nature of the elimination process. The index of 
the eliminated vertex can be used as the index of the new clique. This is called 
an element or enode (the terminology comes from finite-element methods), to 
distinguish it from an uneliminated vertex, which is termed an snode. 
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A sequence of special quotient graphs G!!! = G(A), G?!,..., GH"! with the two 
types of vertices is generated in place of the elimination graphs. Each G!*! has n 
vertices that satisfy 


V(G) = Vsnodes U Venodes; Vsnodes N Venodes = Ø, 


where Vsnodes and Venodes are the sets of snodes and enodes, respectively. When v is 
eliminated, any enodes adjacent to it are no longer required to represent the sparsity 
pattern of the corresponding active submatrix and so they can be removed. This is 
called element absorption. 

Working with these graphs can be demonstrated by considering the computation 
of the vertex degrees. To compute the degree of an uneliminated vertex, the set of 
neighbouring snodes is counted. Then, if a neighbour of one of these snodes is an 
enode, its neighbours are also counted (avoiding double counting). More formally, 
if v € Vsnodes, then the adjacency set of v is the union of its neighbours in Vṣnodes 
and the vertices reachable from v via its neighbours in Venodes. In this way, vertex 
degrees are computed by considering fill-paths, avoiding storing the fill-in entries 
explicitly. This reduces memory requirements and contributes to the computational 
efficiency, which can be further improved by amalgamating sets of indistinguishable 
enodes and snodes. 

The sequences of elimination graphs and quotient graphs are illustrated in 
Figure 8.5. The top line shows G together with G? and G? after the elimination 
of vertices 1 and 2, respectively. When vertex 1 is eliminated, a new edge is 
added to make its neighbours into a clique. The elimination of vertex 2 creates no 
additional fill and the graph G? with three nodes represents the sparsity structure of 
the corresponding active submatrix A@). The bottom line shows the corresponding 
quotient graphs. After the first elimination, vertex | is an enode and the fill edge 
is represented implicitly. After the second elimination, the enodes 1 and 2 can be 
amalgamated and so too can the snodes 3 and 4 because they are indistinguishable. 


Figure 8.5 The top line shows G = G1, G? and G?. The red dashed line denotes a fill edge. The 
bottom line shows the quotient graphs g!?! and G!! after the first and second elimination steps. A 
circle represents a vertex in G (an snode), while a square represents an enode. 
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ALGORITHM 8.2 Basic multiple minimum degree (MMD) algorithm 
Input: Graph G of a symmetrically structured matrix. 
Output: A permutation vector p that defines a new labelling of the vertices of G. 


1: Set k = 1, G! = G and compute the degree deggi(u) of all u € Vg») 
2: while k < n do 
3: Compute mdeg = min{deggx(u)|u € VG*)} 


4: Find all mutually non-adjacent v; € VG), j =1,...,¢ with degg (vj) = 
mdeg 

5 for j = 1:tdo 

6: pk) = vj > Vertex vj is the next vertex in the elimination order 
7: k=k+1 

8 end for 

9 if k <n then 

10: Construct G**! and update the current degrees of its vertices 
11: end if 


12: end while 


8.1.6 Multiple Minimum Degree (MMD) Algorithm 


The multiple minimum degree (MMD) algorithm aims to improve efficiency by 
processing several independent vertices that are each of minimum current degree 
together in the same step, before the degree updates are performed. The basic 
approach is outlined as Algorithm 8.2. At each outer loop, t > 1 denotes the number 
of vertices of minimum current degree that are mutually non-adjacent and so can be 
put into the elimination order one after another. An example in which the four corner 
vertices have the same minimum degree is depicted in Figure 8.6. Here, on the first 
step, mdeg = 2 andt = 4. Note that the MMD strategy is complementary to the 
mass elimination approach in which the set S of indistinguishable vertices that can 
be eliminated one after another is fully interconnected and all vertices of S have the 
same set of neighbours outside S. 


eo_O—_O—_® 


Figure 8.6 The red (corner) vertices of G are each of degree 2 and are ordered consecutively 
during the first step of Algorithm 8.2. 
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The complexity of the MD and MMD algorithms is O(nz(A)n7) but because for 
MMD the outer loop of the algorithm update is performed fewer times, it can be 
significantly faster than MD. MMD orderings can also lead to less fill-in, possibly a 
consequence of introducing some kind of regularity into the ordering sequence. 


8.1.7 Approximate Minimum Degree (AMD) Algorithm 


The idea behind the widely used approximate minimum degree (AMD) algorithm 
is to inexpensively compute an upper bound on a vertex degree in place of the 
degree, and to use this bound as an approximation to the external degree when 
selecting vertices within the MD algorithm. Even though vertex degrees are not 
determined exactly, the quality of the orderings obtained using the AMD algorithm 
are competitive with those computed using the MD algorithm and can surpass them. 
The complexity of AMD is O(nz(A)n) and, in practice, its runtime is typically 
significantly less than that of the MD and MMD approaches. 


8.2 Minimizing the Bandwidth and Profile 


An alternative way of reducing the fill-in locally is to add another criterion to the 
relabelling of the vertices, such as restricting the nonzeros of the permuted matrix 
to specific positions. The most popular approach is to force them to lie close to the 
main diagonal. If Gaussian elimination is applied without further permutations, then 
all fill-in takes place between the first entry of a row and the diagonal or between 
the first entry of a column and the diagonal. It is therefore sufficient to store all the 
entries in the lower triangular part from the first entry in each row to the diagonal and 
all the entries in the upper triangular part from the first entry in each column to the 
diagonal. This allows straightforward implementations of Gaussian elimination that 
employ static data structures. Here we again consider symmetric and, for simplicity, 
we assume that G(A) is connected; generalizations of the terminology and ideas to 
nonsymmetric matrices are possible. 


8.2.1 The Band and Envelope of a Matrix 


To characterize the positions within S{A} that are close to the main diagonal, we 
denote the leftmost entries in the lower triangular part of A using the mapping nj; as 
follows: 


ni(A) = min{j | 1 < j <i witha; #0} Isi<n, (8.1) 
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that is, n; (A) is the column index of the first entry in the i-th row of A. 
Define 


Bi(A) =i-—n (A), 1<i<n. 
The semibandwidth of A is 
max{B;(A)| 1 <i <n}, 
and the bandwidth is 
B(A) = 2 x max{6;(A)| 1 <i<n}4+l. 
The band of A is the following set of index pairs in A 
band(A) = {(i, j)| 0 < i — j < BAD}. 


The envelope is the set of index pairs that lie between the first entry in each row and 
the diagonal 


env(A) = {(i, j)| 0 < i — j < BAD}. 


Note that the band and envelope of a sparse symmetrically structured matrix A 
include only entries of the strict lower triangular part of A. The envelope is easily 
visualized: picture the lower triangular part of A, and remove the diagonal and the 
leading zero entries in each row. The remaining entries (whether nonzero or zero) 
comprise the envelope of A. The profile of A is defined to be the number of entries 
in the envelope (the envelope size) plus n.! An illustrative example is given in 
Figure 8.7. Here nı(A) = 1, ı(A) = 0, m(A) = 1, Bo(A) = 1, ņ(A) = 2, 
ß3(A) = 1, and so on. 


* Ox * Ok OX x Ok 
* OK Ok Ok ® * x * ® * x * 
* Ok ® ® x x * ® * * 
* ok Ox * ; ® ® * * * 3 ® ® * * 
* Ok ® ® * * x x x 
* ok Ok O* ® ® * x ® ® * * 
* Ok ® ® * ® * 


Figure 8.7 Illustration of the band and envelope of a matrix A whose sparsity pattern is on the 
left. In the centre, the positions of band(A) are circled and on the right, the positions of env(A) 
are circled. The bandwidth is 5 and the envelope size and the profile are 7 and 14, respectively. 


' Sometimes in the literature the profile is defined to be the envelope size. 


146 8 Sparse Matrix Ordering Algorithms 


The next result shows that the static data structures determined for A are 
sufficient for its Cholesky factors and by permuting A to minimize its band or 
profile, the fill-in is also approximately minimized. 


Theorem 8.4 (Liu & Sherman 1976; George & Liu 1981) Zf L is the Cholesky 
factor of A, then 


env(A) = env(L). 


Proof The proof uses mathematical induction on the principal leading submatrices 
of A of order k. The result is clearly true for k = 1 and k = 2. Assume it holds for 
2 < k <n and consider the block factorization 


(A i) (“es J LĪ ig Vik 
uig & viz P o g)’ 


where œ and £ are scalars. Equating the left and right sides, L1:k,1:kV1:k = U1:k- 
Because uj = O for j < nk+ı(A) and uy,,, Æ O, it follows that v; = 0 for 
J < nk+1ı(A) and vn, # 0. This proves the induction step. o 


A straightforward corollary of Theorem 8.4 is that band (A) = band (L). 


8.2.2 Level-Based Orderings 


Finding a permutation P to minimize the band or profile of PAP? is combinato- 
rially hard and again heuristics are used to efficiently find an acceptable P. The 
popular Cuthill McKee (CM) approach chooses a suitable starting vertex s and 
labels it 1. Then, fori = 1,2,...,n — 1, all vertices adjacent to vertex i that are 
still unlabelled are labelled successively in order of increasing degree, as described 
in Algorithm 8.3. A very important variation is the Reverse Cuthill McKee (RCM) 
algorithm, which incorporates a final step in which the CM ordering is reversed. 
The CM- and RCM-permuted matrices have the same bandwidth but the latter can 
decrease the envelope, as demonstrated in Figure 8.8. 

The importance of the CM and RCM orderings is expressed in the following 
theorem. The full envelope of the Cholesky factor of the permuted matrix implies 
cache efficiency when performing the triangular solves once the factorization is 
complete. 


Theorem 8.5 (Liu & Sherman 1976; George & Liu 1981) Let A be symmetri- 
cally structured and irreducible. If P corresponds to the CM labelling obtained 
from Algorithm 8.3 and L is the Cholesky factor of PT AP, then env(L) is full, that 
is, all entries of the envelope are nonzero. 
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Figure 8.8 An example to illustrate Algorithm 8.3. The starting vertex is s = 3; it has degree 1. 
The graph G(A) is given and the sparsity patterns of A (left), A symmetrically permuted by the 
CM algorithm (centre) and A symmetrically permuted by the RCM algorithm (right). The profiles 
of these matrices are 25, 17, and 16, respectively. 


A crucial difference between profile reduction ordering algorithms and minimum 
degree strategies is that the former is based solely on G: the costly construction of 
quotient graphs is not needed. However, unless the profile after reordering is very 
small, there can be significantly more fill-in in the factor. 

Key to the success of Algorithm 8.3 is the choice of the starting vertex s: the 
quality of the ordering is highly dependent on s. A good candidate is a vertex 
for which the maximum distance between it and some other vertex in G is large. 
Formally, the eccentricity €(u) of the vertex u in the connected undirected graph G 
is defined to be 


€(u) = max{d(u, v)|v € V}, 


where d (u, v) is the distance between the vertices u and v (the length of the shortest 
path between these vertices). The maximum eccentricity taken over all the vertices 
is the diameter of G (that is, the maximum distance between any pair of vertices). 
The endpoints of a diameter (also termed peripheral vertices) provide good starting 
vertices. The complexity of finding a diameter is O (n?) because the shortest paths 
amongst all the vertices have to be checked. Thus, a pseudo-diameter defined by 
any pair of vertices for which d(u, v) is close to the diameter is used instead. The 
vertices defining a pseudo-diameter are pseudo-peripheral vertices. 
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ALGORITHM 8.3 CM and RCM algorithms for band and profile reduction 
Input: Graph G of a symmetrically structured irreducible matrix and a starting 
vertex s. 

Output: Permutation vectors Pem and p;cm that define new labellings of the vertices 


of G(A). 


1: label(1:n) = false 

2: Compute adjg{u} and degg(u) for all u € V(G) 

3: k= 1, vj =S, Pem(l) = v1, label(v,) = true 

4: fori = 1 : n — 1 do 

5: for w € adjg{v;} with label(w) = false in order of increasing degree do 
6: k=k+1, vp =w, pem(k) = vg, label(ug) = true 

7: end for 

8: end for 

9 


: For the RCM ordering, Prem(@i) = Pem(n — i + 1), i = 1,2,..., n. 


A heuristic algorithm is used to find pseudo-peripheral vertices. A commonly 
used approach is based on level sets. A level structure rooted at a vertex r is defined 
as the partitioning of V into disjoint levels £1 (r), £2(r), ..., £a (r) such that 


(i) £Li(r) = {r} and 
(ii) for 1 < i < h, L;(r) is the set of all vertices that are adjacent to vertices in 
Li—1(r) but are not in £1(r), Lo(r),..., Li-1 (r). 


The level structure rooted at r may be expressed as the set L(r) = 
{Li(r), Lo(r),..., £n(r)}, where h is the total number of levels and is termed 
the depth. The level sets can be found using a breadth-first search that starts at the 
root r. The Gibbs-Poole-Stockmeyer (GPS) algorithm presented as Algorithm 8.4 
can be used to finding pseudo-peripheral vertices, one of which may then be used as 
a starting vertex for the CM and RCM algorithms. Here the root vertex r is normally 
taken to be an arbitrary vertex of minimum degree. £(r) is constructed and then 
the level structures rooted at each of the vertices in the last level set £} (r). If, for 
some w € L£y(r), the depth of Luy exceeds that of L(r), w replaces r as the root 
vertex, and the procedure is repeated. If no such vertex is found, r is chosen as a 
pseudo-peripheral vertex. 

A simple example is given in Figure 8.9. Starting with r = 2, after two passes 
through the while loop, the GPS algorithm returns s = 8 and t = 1 as pseudo- 
peripheral vertices. 

To obtain an efficient implementation of the GPS algorithm, it is necessary to 
limit the number of level set structures that are fully constructed. For example, “short 
circuiting” can be incorporated in which wide level structures are rejected as soon 
as they are detected (wide levels will not lead to a deep level structure which is 
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ALGORITHM 8.4 Basic GPS algorithm to find a pair of pseudo-peripheral 


vertices 
Input: Graph G of a symmetrically structured irreducible matrix and a root vertex 


r. 
Output: Pseudo-peripheral vertices s, t. 


1: Construct £(r) and set flag = false 

2: while flag = false do 

3 flag = true 

4 fori = 1: |£(r)| do 

5: wi E€ L(r) > Select vertex w; from last level set 
6 if flag = true then 

7 Construct L(w;) 

8 if depth(L(wj;)) > depth(L(r)) then 

9 flag = false > Flag that w; will be used as new initial vertex 
10: end if 

11: end if 

12: end for 
13: if flag = true then 

14: s =r andt = wi œ s is chosen; while loop will terminate algorithm 
15: else 

16: r = Wi 

17: end if 


18: end while 


Figure 8.9 An example to illustrate Algorithm 8.4 for finding pseudo-peripheral vertices. With 
root vertex r = 2, the first level set structure is £(2) = {{2}, {1, 3}, {4, 5, 7}, {6, 8}}. Setting r = 8 
at Step 16, the second level set structure is £(8) = {{8}, {4, 7}, {3, 6}, {2, 5}, {1}} and the algorithm 
terminates with s = 8 andt = 1. 


needed for a narrow band). Furthermore, to reduce the number of vertices in the 
last level set L} (r) for which it is necessary to generate the rooted level structures, 
a “shrinking” strategy can be used. This typically involves considering the degrees 
of the vertices in £;,(r) (for example, only those of smallest degree will be tried). 
Such modifications can lead to significant time savings while still returning a good 
starting vertex for the CM and RCM algorithms. As with the MD algorithm, tie- 
breaking rules must be built into any implementation. 
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8.2.3 Spectral Orderings 


Spectral methods offer an alternative approach that does not use level structures. 
The spectral algorithm associates a positive semidefinite Laplacian matrix Lp with 
the symmetric matrix A as follows: 


—1 ifi A j and aij #0, 
(Lp)ij = į degg()  ifi=j, 
(0) otherwise. 


An eigenvector corresponding to the smallest positive eigenvalue of the Laplacian 
matrix is called a Fiedler vector. If G is connected, L p 1s irreducible and the second 
smallest eigenvalue is positive. The vertices of G are ordered by sorting the entries 
of the Fiedler vector into monotonic order. Applying the permutation symmetrically 
to A yields the spectral ordering. 

The use of the Fiedler vector for reordering A comes from considering the matrix 
envelope. The size of the envelope can be written as 


jenv(A)| = 2 = 5 max (i — k). 


i=1 i=1 (k,i)eG 


Observation 8.1 implies that the asymptotic upper bound on the operation count for 
the factorization based on env(A) is 


n n 
workeny = > B? = oa max (i — kyr. 
<li 


i=l i=] (eG 


Ordering the vertices using the Fiedler vector is closely related to minimizing 
weighteny over all possible vertex reorderings, where 


n 


weighteny = >» x (i — k}. 


i=l k<i 


(k,i)EG 


Thus, while minimizing the profile and envelope is related to the infinity norm, 
minimizing weighteny is related to the Euclidean norm of the distance between 
graph vertices. 

Although computing the Fiedler vector can be computationally expensive it 
does have the advantage of easy vectorization and parallelization and the resulting 
ordering can give small profiles and low operation counts. 
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8.3 Local fill-reducing orderings for nonsymmetric S{A} 


If S{A} is nonsymmetric, then an often-used strategy is to apply the minimum 
degree algorithm (or one of its variants) or a band or profile-reducing ordering to the 
undirected graph G(A+A‘ ). This can work well if the symmetry index s(A) is close 
to 1. But if A is highly nonsymmetric (typically, for values of s(A) less than 0.5, 
A is considered to be highly nonsymmetric), then a different approach is required. 
Markowitz pivoting generalizes the MD algorithm by choosing the pivot entry 
based on vertex degrees computed directly from the nonsymmetric S{A}; the result 
is a nonsymmetric permutation. It can be described using a sequence of bipartite 
graphs of the active submatrices but here we use a matrix-based description that 
permutes A on-the-fly. Note that Markowitz pivoting is generally incorporated into 
the numerical factorization phase of an LU solver, rather than being used to derive 
an initial reordering of A. 

At step k of the LU factorization, consider the (n — k + 1) x (n — k + 1) active 
submatrix, that is, the Schur complement s® given by (3.2). Let nz(row,;) and 
nz(col;) denote the number of entries in row i and column j of S @) (1 <i, j<n- 
k + 1). Markowitz pivoting selects as the k-th pivot the entry of S“ that minimizes 
the Markowitz count given by the product 


(nz(row;) — 1)(nz(col;) — 1). 


This strategy is summarized in Algorithm 8.5 and illustrated in Figure 8.10. Here 
the first pivot is a24 with Markowitz count 1; it does not cause fill-in. The second 
pivot has Markowitz count 2 in S®; it results in one filled entry. Note that the 
interchanges of rows and columns that are potentially performed at each of the first 
n — | steps of the factorization give the row and column permutation matrices on 
the output of Algorithm 8.5. Implementation of the algorithm requires access to the 
rows and the columns of the matrix. 


ALGORITHM 8.5 Markowitz pivoting 
Input: Matrix A with a nonsymmetric sparsity pattern. 
Output: A’ = PAQ, where P and Q are permutation matrices chosen to limit fill 
in. 
1: Set SY = A and A’ = A 
2: fork =1:n—1do 
3: Compute nz(row;) and nz(col;) (1 <i, j <n—k+1) 


4: Find an entry Oa of S® that minimizes (nz(row;) — 1)(nz(col;) — 1) 

5 Permute the rows and columns so that a is the (1, 1) entry of the permuted 
sH 

6: Compute Schur complement S+») of the permuted S“) with respect to its 
(1, 1) entry 


7: end for 
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Figure 8.10 Illustration of Markowitz pivoting. The first and second pivots are circled. The 
sparsity pattern of A = S) is on the left. In the centre is the sparsity pattern after permuting the 
pivot in position (2, 4) to the (1, 1) position of S (D), There is no fill-in after the first factorization 
step. On the right is the sparsity pattern after selecting the second pivot that has the original position 
(4, 2) and permuting it to the (1, 1) position of S®. The resulting filled entry is denoted by f. Note 
that the nonsymmetric permutations transform the originally irreducible matrix into a reducible 
one. 


Markowitz pivoting as described here only considers the sparsity of A and the 
subsequent Schur complements. In practice, the pivoting strategy also needs to avoid 
small pivots because, as discussed in the last chapter, they can lead to numerical 
instability. A simple improvement is to break ties in Step 4 by choosing from the 
entries with the minimum Markowitz count the one of largest absolute value. 

Because computing row and column counts is expensive, practical implemen- 
tations may restrict computing them to a limited number of rows and columns. 
Alternatively, the search may be restricted to a predetermined number of rows 
of lowest row count (typically two or three rows), choosing entries with best 
Markowitz count and breaking ties on numerical grounds. Another option is 
to restrict the pivot choice to diagonal entries, in which case A is permuted 
symmetrically. 

Algorithm 8.5 needs storage formats that can accommodate dynamic changes 
to the Schur complements. For example, the DS format described in Section 1.3.2, 
which allows access to both the rows and the columns. However, this format is only 
feasible if the amount of fill-in during the factorization is not large. 


8.4 Global Nested Dissection Orderings 


Nested dissection is the most important and widely used global ordering strategy 
for direct methods when S{A} is symmetric; it is particularly effective for ordering 
very large matrices. It proceeds by identifying a small set of vertices Vs (known as 
a vertex separator) that if removed separates the graph into two disjoint subgraphs 
described by the vertex subsets 6 and W (commonly called “black” and “white”, 
respectively). The rows and columns belonging to B are labelled first, then those 
belonging to WV and finally those in Vs. The reordered matrix has the form 


ABB 0 AB, Vs 

0 Aww Aw,vs |- (8.2) 
T T 
AB vs AWw,vs ^YVs.Vs 
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Figure 8.11 A simple example to illustrate nested dissection. The pattern of the original 
matrix (top), the partitioned graph (centre), and the corresponding symmetrically permuted matrix 
(bottom) are given. 


This is shown for a 13 x 13 example in Figure 8.11. Provided the variables 
are eliminated in the permuted order, no fill occurs within the zero off-diagonal 
blocks. If |Vs| is small and |5| and |W] are similar, these zero blocks account 
for approximately half the possible entries in the matrix. The reordering can be 
applied recursively to the submatrices Ag g and Aw, until the vertex subsets 
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ALGORITHM 8.6 Nested dissection algorithm 

Input: Graph G of a symmetrically structured matrix A and a partitioning algorithm 
PartitionAlg. 

Output: A permutation vector p that defines a new labelling of the vertices of G. 


1: recursive function (p = nested_dissection(A, PartitionAlg)) 


2: if dissection has terminated then m> Vertex subsets are smaller than some 
threshold 
3: p = AMD(), £) > Compute an AMD ordering 
4: else 
5: Use PartitionAlg(V, €) to obtain the vertex partitioning (B, W, Vs) 
6: pg = nested_dissection(A zg g, PartitionAlg) 
7: Pw = nested_dissection(Ayy w, PartitionAlg) 
8: Pys is an ordering of Vs 
PB 
9: Set p= | pw 
PVs 


10: end if 
11: end recursive function 


are of size less than some prescribed threshold. At this stage, a local ordering 
technique (such as AMD) is normally more effective than nested dissection, and so a 
switch is made. The general form of the nested dissection algorithm is summarized 
in Algorithm 8.6. The parameter PartitionAlg specifies the algorithm used in 
determining the partitioning of the vertices. The performance and efficacy is highly 
dependent on the choice of PartitionAlg. Originally, level set based methods were 
used but most current approaches use multilevel techniques that create a hierarchy 
of graphs, each representing the original graph, but with a smaller dimension. The 
smallest (that is, the coarsest) graph in the sequence is partitioned. This partition 
is propagated back through the sequence of graphs, while being periodically 
refined. 


8.5 Bordered Forms 


Another possibility to exploit the global matrix structure is to use bordered block 
forms. These forms can arise naturally in some practical applications. 
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8.5.1 Doubly Bordered Form 


The matrix (8.2) is an example of a doubly bordered block diagonal (DBBD) 
form. More generally, a matrix is said in DBBD form if it has the block structure 


Ai Ci 
A2,2 C2 
ADB = os » Mb (8.3) 
Anb,nb CNb 
Ri Ro... RNb B 


where Nb > 1, the blocks Ajp 1p on the diagonal are square njp x njp matrices 
and the border blocks Cjp and Rip are nip x ns and ns xX nip matrices, respectively, 
with ns < np (1 < lb < Nb). B is an ng x ng matrix. The blocks can have very 
different sizes. A nested dissection ordering can be used to permute a symmetrically 
structured matrix A to a symmetrically structured DBBD form (S{R;} = S{C T). 
If S{A} is close to symmetric, then nested dissection can be applied to S{A + AT}. 
In finite-element applications, the DBBD form corresponds to partitioning the 
underlying finite-element domain into non-overlapping subdomains; each Ajp jp 
represents the interior of a subdomain and the variables in the borders are those 
that lie on an interface between two or more subdomains. 

Coarse-grained parallel approaches aim to factorize the Ajp ıp blocks in parallel 
before solving the interface problem that connects the blocks. The block factoriza- 
tion of App is 


Ly Ui Ci 
Lo U2 C2 
ADB = as _ | 
Ji Lnb Unb CNb 
R, R ... Ryp Ls Us 
where 
Nb 


Rip = RwUy,', Cy = Lp Cw (1 < lb < Nb), LsUs=B-— Ņ_ RC. 
lb=1 


The process is summarized in Algorithm 8.7. Here, for simplicity of notation, the 
permutation matrices for the block factorizations are set to the identity; in practice, 
Alp.tb = Pip LibUip Qip for some permutation matrices Pp and Qjp (1 < Ib < Nb) 
and S = PsLsUs Qs for some permutation matrices P, and Qs. 

There are several opportunities to incorporate parallelism. First, the factoriza- 
tions of the blocks Aj, 7» on the diagonal are completely independent. In addition, 
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ALGORITHM 8.7 Coarse-grained parallel LU factorization using DBBD form 
Input: Matrix Apg in DBBD form (8.3). 
Output: Block LU factorization. 


1: Initialise S = B 

2: for Ib = 1: Nb do 

3 Alb, ib = LipUip > LU factorization of square block on diagonal 
4 Rip = RU z > Triangular solve for bottom-border blocks 
5: an = Ee Cib > Triangular solve for right-border blocks 
6: end for 

7: S=S- Se Rip Ca > Assemble updates to interface block 
8: S = LsUs > Factorize updated interface block (Schur complement) 


the factorization of each individual Azp 7, can be parallelized. The same is true 
for the triangular solves that update the border blocks. Second, the assembly of 
the interface block S can be partially parallelized (it can be started as soon as the 
first updated border blocks are available). Third, the LU factorization of § can be 
parallelized. 

Observe that S is generally significantly denser than the other blocks and can 
present a computational bottleneck. In fact, not only is factorizing S expensive in 
terms of the memory and operations required, assembly updates to it can be time 
consuming. This is because multiple submatrices may contribute to the same entry 
of S, and these cannot be performed at the same time. Furthermore, for an efficient 
parallel implementation, load balance must be considered. If the work required for 
factorizing each of the blocks on the diagonal is not similar, then the time will be 
dominated by the most expensive block. One possible solution is to choose Nb to be 
greater than the number of processors and use dynamic scheduling to achieve good 
load balance. Unfortunately, if the number of blocks increases, so too does the size 
of S. 

If A is not SPD, then factorizing the A), 1p blocks without considering the entries 
in the border can potentially lead to stability problems. Consider the first step in 
factorizing Ajp 7, and the threshold pivoting test (7.5) for a sparse LU factorization. 
The pivot candidate (Ap, 1b)11 must satisfy 


max{max |(Aib,1b)i1l, max ICRib)ki l} < vA), 
i> 


where y e€ (0, 1] is the threshold parameter. Large entries in the row border matrix 
Rip can prevent pivots being selected within Ajp jp. Stability can be maintained by 
moving rows and columns that cannot be eliminated to the borders. This increases 
the border size and may adversely affect the a priori sparse data structures for 
holding the factors, increase the work required to perform the factorization, and 
reduce the potential for parallelism within the factorization of the block. 
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8.5.2 Singly Bordered Form 


An alternative strategy is to permute A to singly bordered block diagonal (SBBD) 
form 


Ail Ci 
A22 C2 
ANb,Nb CNb 


where the blocks Azp ıp are rectangular mp x nj, matrices with mj, > nip and 
Erei mı = n, and the border blocks Cjp are of order mp X ng (ng & njp), where 
n; = yei (mp — nip). The linear system becomes 


Ait Cı #1 bi 
: b2 
A22 C2 a = eee © (8.4) 
: XNb is 
Anb,Nb Cno/ \ x, bnp 


where xz, is of length n;p, x7 is a vector of length nz of interface variables, and the 
right-hand side vectors bjp are of length mjp, such that 


X 
(Arp, Cib) a = bmp, 1<lb< Nb. 


A partial factorization of each block matrix is performed, that is, 


L Up U, 
(Ain.ip Civ) = Po (? P ( i 2 Qib, (8.5) 


where Py, and Qp are permutation matrices, L;, and Uj, are nip X njp lower and 
upper triangular matrices, respectively, and if gj, is the number of columns in Cj, 
with at least one entry, Sj, is a (mip — nıb) X qıp local Schur complement matrix. 
Pivots can only be chosen from the columns of Aj» 1p because the columns of Cip 
have entries in at least one other border block Cj, (jb 4 Ib). The pivot candidate 
(Aib 1b)11 at the first elimination step must satisfy 


max |(Ayo,15)i1| < y~' (Aww), 
I> 
and provided A is nonsingular, there will always be a numerically satisfactory 


pivot in column 1 of Aj, 1b. The same is true at each elimination step so that njp 
pivots can be chosen. Ann; x ny matrix S is obtained by assembling the Nb local 
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ALGORITHM 8.8 Coarse-grained parallel LU factorization and solve using 
SBBD form 
Input: Linear system in SBBD form (8.4). 
Output: Block LU factorization and computed solution x. 
1: S=Oandz; = 0 
2: for Ib = 1: Nb do 
3: Perform a partial LU factorization (8.5) of (Ajp,15, Cib). 


4: Solve Pip & ate = bip 

Lip I1) \ Yb 
5 S = S + Sıp and zy = Z1 + Yib > Assemble S and z7 
6: end for 
7. S= P,L.Us Qs > P, and Q, are permutation matrices 
8: Solve P;Ls yy = zz and then Us Qs x; = yr œ> Forward then back substitution 
9: for Ib = 1: Nb do 


10: Solve Ur, Qib Xib = Yib — Üi Qw X1 
11: end for 


Schur complement matrices S74. The approach is summarized as Algorithm 8.8. The 
operations on the submatrices can be performed in parallel. 


8.5.3 Ordering to Singly Bordered Form 


The objective is to permute A to an SBBD form with a narrow column border. One 
way to do this is to choose the number Nb > 1 of required blocks and use nested 
dissection to compute a vertex separator Vs of G(A + AT) such that removing Vs 
and its incident edges splits G(A + AT) into Nb components. Then initialize the 
set Sc of border columns to Vs and let Vip, V2p,..., Vyp be the subsets of column 
indices of A that correspond to the Nb components and let n; kb be the number of 
column indices in row i that belong to yp. If lb = arg maxi<xp<np |Ni,kp|, then 
row i is assigned to partition /b. All column indices in row i that do not belong to 
Vip are moved into Sç. Once all the rows have been considered, the only rows that 
remain unassigned are those that have all their nonzero entries in Vs. Such rows can 
be assigned equally to the Nb partitions. If j € Sc is such that column j of A has 
nonzero entries only in rows belonging to partition kb, then j can be removed from 
Sc and added to Vg. The procedure is outlined as Algorithm 8.9. The computed 
vector block and set Sc can be used to define permutation matrices P and Q such 
that PAQ = Asp. In practice, it may be necessary to modify the algorithm to 
ensure a good row balance between the number of rows in the blocks; this may lead 
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ALGORITHM 8.9 SBBD ordering of a general matrix 

Input: Matrix A, the number Nb > 1 of blocks and corresponding vertex separator 
Vs of G(A + A’). 

Output: Vector block such that block(i) denotes the partition in the SBBD form to 
which row i is assigned (1 <i < n) and Sç is the set of border columns. 


1: Initialise Se = Vs and block(1 : n) = 0 
2: Initialise Vy, to hold the column indices of A that correspond to component kb 
of G(A + A’) after the removal of Vs, 1 < kb < Nb 


3: for each row i do 

4 Add up the number n; kp of column indices belonging to Vkp, 1 < kb < Nb 
5: Find /b = arg Max|<kb<Nb Ni,kb 

6: block(i) = lb 

7 for each column index j in row i do 

8 if j € Vey and kb F Ib then 

9 Remove j from Vkp and add to Sc 


10: end if 
11: end for 
12: end for 


13: Assign the rows i for which block(i) = 0 equally between the Nb partitions. 
14: If some column j € Sc has nonzero entries only in rows belonging to partition 
kb then remove j from Sç and add to Vx, 


to a larger Sc. It is also necessary to avoid adding in duplicate column indices into 
Sc (alternatively, a final step can be added that removes duplicates). 

The matching-based orderings discussed in Section 6.3 that permute off-diagonal 
entries onto the diagonal can increase the symmetry index of the resulting reordered 
matrix, particularly in cases where A is very sparse with a large number of zeros 
on the diagonal. Frequently, applying a matching ordering before ordering to SBBD 
form reduces the number of columns in Sc. 


8.6 Notes and References 


The most influential early paper on orderings for sparse symmetric matrices is 
that of Tinney & Walker (1967). It first proposed the minimum degree algorithm 
(referred to as scheme 2) and the minimum fill-in algorithm (referred to as scheme 
3). The fast implementation of the minimum degree algorithm using quotient 
graphs is summarized by George & Liu (1980a). Further developments were 
made throughout the 1980s, including the multiple minimum degree variant, mass 
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elimination and external degree; key references are Liu (1985) and George & Liu 
(1989). An important development in the 1990s was the approximate minimum 
degree algorithm of Amestoy et al. (1996). Modifying the AMD algorithm for 
matrices with some dense rows is discussed in Dollar & Scott (2010). For a 
careful description of different variants of the minimum degree strategy and their 
complexity we recommend Heggernes et al. (2001). Rothberg & Eisenstat (1998) 
consider both minimum degree and minimum fill strategies and (Erisman et al., 
1987) provide an early evaluation of different strategies for nonsymmetric matrices. 

Jennings (1966) presents the first envelope method for sparse Cholesky factor- 
izations. The Cuthill-McKee algorithm comes from the paper by Cuthill & McKee 
(1969). The GPS algorithm was originally introduced in Gibbs et al. (1976). The 
book by George & Liu (1981) gives a detailed description of the algorithm while 
Meurant (1999) includes an enlightening discussion of the relation between the CM 
and RCM algorithms. A quick search of the literature shows that a large number of 
bandwidth and profile reduction algorithms have been (and continue to be) reported. 
Many have their origins in the Cuthill-McKee and GPS algorithms. A widely used 
two-stage variant that employs level sets is the so-called Sloan algorithm (Sloan, 
1986); see also Reid & Scott (1999) for details of an efficient implementation. The 
use of the Fiedler vector to obtain spectral orderings is introduced in Barnard et al. 
(1995), with analysis given in George & Pothen (1997). A hybrid algorithm that 
combines the spectral method with the second stage of Sloan’s algorithm to further 
reduce the profile is proposed in Kumfert & Pothen (1997) and a multilevel variant 
is given by Hu & Scott (2001). de Oliveira et al. (2018) provide a recent comparison 
of many bandwidth and profile reduction algorithms. 

Reducing the bandwidth when A is nonsymmetric is discussed by Reid & 
Scott (2006). For highly nonsymmetric A, Scott (1999) applies a modified Sloan 
algorithm applied to the row graph (that is, G(AA’ )) to derive an effective ordering 
of the rows of A for use with a frontal solver. The approach originally proposed 
by Markowitz (1957) for finding pivots during an LU factorization is incorporated 
(in modified form) in a number of serial LU factorization codes, including the 
early solvers MA28 and Y12M (Duff, 1980 and Zlatev, 1991, respectively) as well 
as MA48 (Duff & Reid, 1996). The book of Duff et al. (2017) includes detailed 
discussions. To limit permutations to being symmetric, Amestoy et al. (2007) 
propose minimizing the Markowitz count among the diagonal entries. 

A seminal paper on global orderings is George (1973), but a real revolution in 
the field followed the theoretical analysis of the application of nested dissection for 
general symmetrically structured sparse matrices given in Lipton et al. (1979). For 
subsequent extensions discussing separator sizes we suggest Agrawal et al. (1993), 
Teng (1997), and Spielman & Teng (2007). 

From the early 1990s onwards, there have been numerous contributions to graph 
partitioning algorithms. Significant developments, including multilevel algorithms, 
have been driven in part by the design and development of mathematical software, 
notably the well-established packages METIS (2022) and Scotch (2022); both 
offer versions for sequential and parallel graph partitioning (see also the papers 
by Karypis & Kumar, 1998a,b and Chevalier & Pellegrini, 2008). The book by 
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Bichot & Siarry (2013) discusses a number of contributions, including hypergraph 
partitioning, which is well suited to parallel computational models (see, for example, 
Ucar & Aykanat, 2007 and references to the use of hypergraphs given in the survey 
article of Davis et al., 2016; they can also be used for profile reduction Acer et al., 
2019). 

Hu et al. (2000) present a serial algorithm for ordering nonsymmetric A to SBBD 
form; an implementation is available as HSL_MC66 within the HSL mathematical 
software library. Algorithm 8.9 is from Hu & Scott (2005) (see also Duff & Scott, 
2005). Alternatively, hypergraphs can be used for SBBD orderings. The best-known 
packages are the serial code PaToH of Aykanat et al. (2004) and the parallel code 
PHG from Zoltan (2022). 
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Chapter 9 A 
Algebraic Preconditioners and TRICA 
Approximate Factorizations 


In conjunction with iterative methods, preconditioning is often 
the vital component in enabling the solution of such (linear) 
systems when the dimension is large. — Wathen (2015) 


Preconditioning involves exploiting ideas from sparse direct 
solvers. Gradually, iterative methods started to approach the 
quality of direct solvers. In earlier times, iterative methods were 
often special purpose in nature... Now iterative methods are 
almost mandatory. — Saad (1996b). 


When a matrix factorization is performed using finite precision arithmetic, the 
computed factors are not the exact factors. Despite this, the objective of sparse direct 
methods is normally to compute solutions that are accurate within the precision 
used. As discussed in Chapter 7, theoretical results can be used to assess both 
stability and accuracy. 

The effort to obtain results that are as accurate as possible can lead to complex 
coding and unavoidable inefficiencies that can be magnified by modern computer 
architectures. Furthermore, in some situations, more accuracy than is needed (or 
is justified by the input data) is sought by a direct method. These issues can 
potentially be addressed by intentionally relaxing the required accuracy of the 
computed factors. In Section 7.3.3, we discussed static pivoting that allows pivots to 
be explicitly perturbed during a matrix factorization to enable them to be selected, 
thereby reducing the computational costs of the factorization (in terms of time and 
memory). The penalty is that the factorization may be less stable and a refinement 
process (such as described in Algorithm 7.3) may be needed to improve the 
accuracy of the computed solution. However, even with sophisticated theoretical 
and algorithmic tools, factorizations that use such strategies can still be prohibitively 
expensive and may not be fully robust. An alternative approach is to compute a 
simpler and cheaper and sparser approximate factorization of A (or of A~!) and 
to use this as a preconditioner in combination with an iterative solver to derive a 
suitable solution of the linear system. The main obstacle is that the choice of an 
efficient preconditioner is highly problem dependent: what works well for problems 
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from one application may not help for those of a different origin. Our focus is on 
algebraic preconditioners that are often successfully used in the solution of linear 
systems arising from a range of diverse applications. 

Algebraic preconditioners do not require knowledge of the provenance of the 
linear system, and their construction relies solely on the matrix A (which may 
only be available implicitly, that is, the action of A on vectors is known, but A 
itself is not supplied). They are general methods that are particularly important 
when little is known about the underlying problem and they are widely applicable 
because they are designed with few restrictions. However, if more information is 
known, it can be more effective to use a specialized preconditioner that is designed 
for the specific application. This division between approaches to preconditioning 
essentially amounts to whether we are “given a problem” or “given a matrix”: 
algebraic preconditioning is primarily concerned with the latter. 

In the following, we refer to an approximate factorization as an incomplete 
factorization to distinguish it from a complete factorization of a direct method. 


9.1 Introduction to Iterative Solvers 


The two main classes of iterative methods for solving Ax = b are stationary 
iterative methods (also sometimes called relaxation or simple methods) and Krylov 
subspace methods. We briefly introduce each class. 


9.1.1 Stationary Iterative Methods 


Stationary iterative methods work by splitting A as follows: 
A=M-N, 


where the matrix M is chosen to be nonsingular and easy to invert. Starting with an 
initial guess x), the iterations are then given by 


x") = y-'Nx® + M7!b, k=0,1,... (9.1) 
This can be rewritten as 
xD = 4 yo — Ax) = xP 4m rO, £=0,1,..., (9.2) 


where the vector r® = b — Ax is the residual on the k-th iteration. Observe that 
by substituting b = r + Ax™ into x = AT! b, we obtain 


x= AEO 4 Ax®) = x 4 At, 
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and if M is used to approximate A, we again get the iteration (9.2). From (9.2), 


rt) =b- AQP +M! r®) = G-AM)r® =... = (1 —- AMD rO, 
(9.3) 
and if e% = x — x) is the error vector on iteration k, then 


e+) = MIN e® =...= (MINH e® = (1 — MAH! ©, (94) 


The matrix J — M~'A or I — AM”! is called the iteration matrix. In general, 
(9.3) is evaluated rather than (9.4) because e is unknown and (9.3) computes the 
residuals that are often used to monitor convergence. 


Theorem 9.1 (Saad 2003b; Greenbaum 1997) For any initial x and vector b, 
the iteration (9.1) converges if and only if the spectral radius of the iteration matrix 
(I — M~!A) is less than unity. 


Proof The spectral radius of an n x n matrix C with eigenvalues 21, A2,..., An iS 
defined to be 

p(C) = max{|A;|| 1 <i <n}. (9.5) 
Furthermore, the sequence of matrix powers C K k =0,1,..., converges to zero if 


and only if o(C) < 1. It follows from (9.4) that if the spectral radius of (J — M TIA) 
is less than unity, then the iteration (9.1) converges for any x and b. Conversely, 
the relation 


xD _ 4 = (1—-MTIN) œP —x"Y) =... = (I1 — M'N) MT! (b— Ax) 


shows that if the iteration converges for any x and b, then (J — MT!N)*v 
converges to zero for any v. Consequently, o(J — M~'A) must be less than unity, 
and the result follows. o 


It is generally impractical to compute the spectral radius and sufficient conditions 
that guarantee convergence are used. Because o(C) < ||C|| for any matrix norm, a 
sufficient condition is |Z — M~!Al| < 1. A small spectral radius leads to rapid 
convergence, and the closer the eigenvalues of M~'A are to unity, the faster the 
convergence. However, the eigenvalue distribution (not just the spectral radius) is 
important in evaluating the rate of convergence. 

Several standard stationary methods are obtained from the splitting 


A=D,a+La+t+Ua, (9.6) 
where D4 is a diagonal matrix that represents the diagonal part of A, and L4 and 


U4 are the strictly lower and upper triangular parts of A, respectively. If œ > Oisa 
scalar parameter, classical methods include: 
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* Richardson method: M = œ! I 
e Jacobi and damped Jacobi methods: M = D4 and M = o`! Da 
e Gauss-Seidel and SOR methods: M = D4 + La and M = œ~! D4 + La 


9.1.2 Krylov Subspace Methods 


Non-stationary iterative methods are of the form 
xétD — x + wo” MT! r® k= 0,15... 


where the w“) are scalars. In this class of methods, Krylov subspace methods are 
the most effective. Given a vector y, the k-th Krylov subspace K (A, y) generated 
by A from the vector y is defined to be 


K“ (A, y) = span(y, Ay,..., AX" y). 


The idea behind Krylov subspace methods is to generate a sequence of approximate 
solutions x e x + (A,r) such that the norm of the corresponding 
residuals r e K+) (A,r) converges to zero. For symmetric positive definite 
(SPD) systems, the Krylov subspace method of choice is the conjugate gradient 
(CG) method. For nonsymmetric systems, there are a number of popular methods, 
including the generalized minimal residual (GMRES) method and the biconjugate 
gradient (BiCG) method, but there is no single method of choice. The key feature 
they have in common is that at each iteration only matrix-vector products with A 
(and possibly with A? in the nonsymmetric case) are required. 

Krylov subspace methods are powerful and nowadays, when combined with a 
preconditioner, comprise the most widely used class of preconditioned iterative 
methods. Because they build a basis, in exact arithmetic, convergence is achieved in 
at most n iterations (but in the presence of rounding errors, this is not guaranteed). 
If n is large, it is impractical to perform O(n) iterations; the hope is that the process 
returns a sufficiently accurate solution far earlier. Unfortunately, for a given A, right- 
hand side vector b, and initial guess x, it is usually not possible to predict the rate 
of convergence. If A is an SPD matrix, then it can be shown that the approximate 
solution x“ at iteration k computed using the CG method satisfies 


k 
A)— 1 
-181 =2(— ) Ix xls, 


where || - ||4 is the A-norm, and «(A) is the spectral condition number given 
by (7.15). Clearly, there is good (fast) convergence when «(A) is small, but poor 
(slow) convergence usually occurs if k (A) >> 1. But this error bound can be highly 
pessimistic. It does not show the potential for CG to converge superlinearly or 
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that the rate of convergence depends on the distribution of all the eigenvalues of 
A. In practice, it is not normally possible to obtain detailed spectral information. 
Thus, even for CG, preconditioning is often based on experimentation. For non- 
SPD matrices, less is known and methods that guarantee the monotonic reduction 
of a relevant quantity at each iteration are sometimes favoured. For example, if 
the minimal residual (MINRES) method is used for solving symmetric indefinite 
systems, then in exact arithmetic, the norm of the residual is monotonically 
decreasing. However, no general descriptive convergence theory is available for 
Krylov subspace methods for nonsymmetric systems (including GMRES). This is 
a significant problem because, without theory to guide us, preconditioning must be 
heuristic. 


9.2 Introduction to Algebraic Preconditioners 


Preconditioning corresponds to the application of a matrix (or a linear operator) to 
the original linear system to yield a different linear system that has more favourable 
properties. Consider the preconditioned linear system 


M~'Ax = M`! b. (9.7) 


Here M~! is applied to A from the left. We say that A is preconditioned from the left 
and M is a left preconditioner. Analogously, the linear system can be preconditioned 
from the right 


AM! y =b, x=M!y. (9.8) 


The following result states that it is not possible to determine a priori which variant 
is the best. 


Theorem 9.2 (Mendelsohn 1956) Let ô and A be positive numbers. Then, for any 
n > 3, there exist nonsingular n x n matrices A and M such that all the entries of 
MT!A — I have absolute value less than 6 and all the entries of AMT! — I have 
absolute values greater than A. 


Nevertheless, the choice between left and right preconditioning is still important 
and may be based on the properties of the coupling of the preconditioner with 
the iterative method or on the distribution of the eigenvalues of A. The computed 
quantities that are readily available during a preconditioned iterative method depend 
on how the preconditioner is applied and this may influence the choice. These 
quantities may be used, for example, to decide when to terminate the iterations. An 
obvious advantage of right preconditioning is that in exact arithmetic, the residuals 
for the right preconditioned system are identical to the true residuals, enabling 
convergence to be monitored accurately. In some cases, the numerical properties 
of an implementation and/or the computer architecture may also play a part. 
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For M in factorized form M = M,M)o, two-sided (or split) preconditioning is 
an option. The iterative method then solves the transformed system 


M,'AMy!y=M,'b, x=M;'y. (9.9) 


If A and M are SPD matrices, then M2 = M i and we would like the preconditioned 
matrix M; ‘AM 7 T to be SPD. However, it is not necessary to use a two-sided 
transformation with the preconditioned conjugate gradient (PCG) method because 
it can be formulated using the M-inner product in which the matrix M~'A is self- 
adjoint. 


Theorem 9.3 (Saad 2003b; van der Vorst 2003) Let A and M be SPD matrices. 
Then M~'A is self-adjoint in the M-inner product. 


Proof Self-adjointness is implied by the following chain of equivalences. 


(M~' Ax, y)m = (Ax, y) = (x, Ay) = (x, MMT" Ay) 
= (Mx, M~'Ay) = (x, M7! Ay). 
oO 


Left preconditioned CG with the M-inner product is mathematically equivalent 
to right preconditioned CG with the M~!-inner product. If A is symmetric but not 
positive definite, the PCG method can formally be written down, but the necessary 
conditions for convergence may not be satisfied and the method may break down 
(division by a zero quantity). 


9.2.1 Desirable Preconditioner Properties 


An obvious objective is for the preconditioner to lead to rapid convergence. As 
already noted, if the matrix A is SPD, then the convergence rate of the CG method 
depends on the distribution of its eigenvalues. The preconditioner should aim to 
reduce the condition number, but this is not necessarily sufficient to give fast con- 
vergence. For general matrices, despite the lack of theoretical guarantees regarding 
convergence, many useful preconditioners have nevertheless been motivated by 
bounding the condition number of the preconditioned matrix. 

Choosing a preconditioner is often based on how costly it is to compute and on 
some indicators that potentially reflect its quality. In particular, the accuracy of a 
preconditioner M can be assessed using the norm of the error matrix 


IEJ = |M — All, 
and its stability can be measured using 


IMT'E] = |Z- M'A or EM'I = |Z — AM™'|. 
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If a preconditioner is used to solve a large number of systems over which the 
cost of constructing it can be amortized, then the expense of constructing M in 
terms of time may not be the driving factor. However, as the preconditioner must 
be applied at each iteration of the solver, unless very few iterations are performed, 
it is essential that each application is inexpensive. Each application M~!w involves 
solving a linear system Mv = w. If M is in factorized form and the factors are 
(block) triangular, this is straightforward but because they are inherently serial and 
hard to parallelize, repeated substitutions can be a critical computational bottleneck. 
In some cases, rather than M, the inverse M~! is computed directly. In this case, 
we have an approximate inverse preconditioner. Applying such a preconditioner 
involves only matrix-vector multiplications, which are normally easier to parallelize. 
However, because the inverse of an irreducible matrix is dense (Theorem 7.3), it is 
important that M~! is constructed to be sparse. Such preconditioners are discussed 
in Chapter 11. 


9.2.2 Simple Algebraic Preconditioners 


The simplest preconditioner consists of the diagonal of the matrix M = D4. This 
is known as the (point) Jacobi preconditioner. Block versions can be derived by 


partitioning V = {1,2,..., n} into mutually disjoint subsets Vj,..., V; and then 
setting 
aij if i and j belong to the same subset Vg for some k, 1 < k < l, 
mij = 
k 0 otherwise. 


Often, natural choices for the partitioning suggest themselves. For example, super- 
variables can be used or the partitioning may be chosen to coincide with the division 
of variables over the processors in a parallel environment. Jacobi preconditioners 
need very little storage and are easy to implement. 

The SSOR preconditioner, like the Jacobi preconditioner, can be derived from 
A without any work. If A is symmetric, then using the notation (9.6), the SSOR 
preconditioner is defined to be 


M = (Da + LA)D}' (Da + La)”, (9.10) 
or, using a parameter 0 < w < 2, as 


1 1 
M = —— 


1 1 
3 (—Da + La)X(—Da)!(—D4 + La)’. 
= 0 Ww w w 
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The optimal value of w will reduce the number of iterations needed for convergence 
of the iterative solver, but it is usually prohibitively expensive to compute the 
spectral information needed to calculate it. Again, block variants are possible. 


9.2.3. The Eisenstat Trick 


Within a preconditioned iterative solver, it is generally cheaper to apply M7! 
and A separately, rather than explicitly forming and storing the preconditioned 
matrix. However, in special cases, it is possible to improve efficiency by combining 
the action of the preconditioner with the matrix-vector multiplication. One such 
approach is called the Eisenstat trick. Consider the matrix splitting (9.6), and let 
M be given by 


M =(D+La)[D7'(D + Ua)] = M1 Mp, (9.11) 
where D is a nonsingular diagonal matrix. The SSOR matrix (9.10) is one 
example in the symmetric case but more generally D # Da. Using two-sided 
preconditioning, (9.9) becomes 

Aly = M,'AM;" y = (D + La) ALD! (D + Ua)! y = (D + La) tb. 
(9.12) 
Setting 
L=D"'La, U=D"'U,, A= DA, and b= (I +L) tD! b, 
and using (9.6), we obtain 
A’ =(D+LA) ALD! (D +UA)! = [(D + La)! D]D~! A[D~!(D + Ug)! 
=[D-'(p + LA]! DT!AU + D'U p l= U +DAU +0. 
That is, the system in (9.12) becomes 
Aly = (1+ L) AU +0)! y=(4+L)'!D"'!b=b. (9.13) 


If y solves (9.13), then the solution x of (J + U)x = y solves Ax = b. But the 
expression for A’ can be further transformed as 


A =(1+L)7!'( +L4+D'!D4—-214+14+0)04+0)"! 
=(U4+L)'[(U+Qd0+0)!4(07'!D,-2N0 4+ 0) 147 


=(1+U0)'+U+2L)'((D'D,s-2DU0 +0) + I. 
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Thus, to compute z = A’w = (I+ LD '!A(I +Ü) !w fora given w, it is necessary 
only to solve two triangular systems 


(I+Ū)zi=w followedby (1+ L)z2 = (D7!D4 —2I)zi +w 


and then set z = zı + z2. Note that this trick is not a preconditioner: it is simply a 
way of applying the preconditioner (9.11). 


9.3 Some Special Classes of Matrices 


The development of algebraic preconditioners has historically been closely con- 
nected to their earliest application, which was solving linear systems arising from 
the discretization of partial differential equations. Consider a two-dimensional 
Poisson problem discretized on a given domain by a uniform regular grid using finite 
differences, with zero Dirichlet conditions on the boundary. The resulting matrix for 
a3 x 3 rectangular grid using the natural ordering of the vertices is given by 


4-1 -=i 
-i 4-1 =i 
-14 -1 
-1 4-1 =i 
A= 1 14-1 1 ‘ (9.14) 
-=i -14 -1 
-1 4-1 


=i -1 4-1 
-1 -14 


If the spatial discretization on the domain is characterized by the mesh parameter 
h, then the size of A is inversely proportional to h. Expressing some matrix- 
related quantities asymptotically as functions of h can be useful if the discretized 
domain is bounded. For example, the condition number of the matrix (9.14) 
depends asymptotically on h~?. Matrices with similar banded sparsity patterns 
with nonzeros on only a small number of subdiagonals arise from simple finite 
difference or finite element discretizations of other partial differential equations. 
They can be considered as particular cases of more general special classes of 
matrices whose properties can be derived using the theoretical background behind 
the discretizations. 

M-matrices is one such class. Let the off-diagonal entries of the nonsingular 
matrix A be nonpositive (that is, aj; < 0 for alli Æ j). Then A is a (nonsingular) 
M-matrix if one of the following holds: 


e A+ D is nonsingular for any diagonal matrix D with nonnegative entries. 
e All the entries of A~! are nonnegative. 
e All principal minors of A are positive. 
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The matrix (9.14) is an example of an M-matrix. A symmetric M-matrix is known 
as a Stieltjes matrix, and such a matrix is positive definite. 

The class of nonsingular H-matrices includes matrices coming from simple 
discretizations of convection—diffusion problems. The comparison matrix C(A) of 
A is defined to have entries 


E Jai j | , i i, 
C(A)ij = om j É 

laijl, i=j. 
If C(A) is a nonsingular M-matrix, then A is a nonsingular H-matrix. 


We also recall diagonally dominant matrices. A is diagonally dominant by rows 
if 


XO ajl < laul, 1<i<n. (9.15) 


A is strictly diagonally dominant by rows if strict inequality holds in (9.15) for 
all i. A is (strictly) diagonally dominant by columns if A’ is (strictly) diagonally 
dominant by rows. A is said to be irreducibly diagonally dominant if it is 
irreducible and (9.15) is satisfied with strict inequality for at least one row i. If 
A is strictly diagonally dominant by rows or columns or is irreducibly diagonally 
dominant, then it is nonsingular and factorizable. The class of diagonally dominant 
matrices is closely connected to that of nonsingular H-matrices. For example, the 
property that there exists a diagonal matrix D with positive entries such that AD is 
strictly diagonally dominant is equivalent to A being a nonsingular H-matrix. 


9.4 Introduction to Incomplete Factorizations 


Preconditioners based on an incomplete factorization of A in which entries are 
dropped during the factorization are widely used in computational science and 
engineering, especially when the underlying physics of a problem is difficult to 
exploit. Besides being used as standalone preconditioners, incomplete factorizations 
are important within more sophisticated methods. For example, they can be used to 
precondition subdomain solves in domain decomposition schemes or as a smoother 
in multigrid methods. Incomplete factorizations fall into three main classes: 


(i) Threshold-based methods in which the locations of permissible fill-in are 
determined in conjunction with the numerical factorization of A; entries of 
the computed factors of absolute value less than a prescribed threshold t > 0 
are dropped. Success relies on determining a suitable t. This is highly problem 
dependent and is influenced by the scaling of A. 
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Figure 9.1 Illustration of matrix sparsification. f denotes filled entries in the factors. On the left 
is the original matrix A with its filled entries, in the centre is the permuted matrix with its filled 
entries, and on the right is the sparsified permuted matrix after dropping the entries of A in positions 
(1, 3) and (3, 1) (it has no filled entries). 


(ii) Memory-based methods in which the amount of memory available for the 
incomplete factorization is prescribed and only the largest entries in each row 
(or column) are retained. 

(iii) Structure-based methods in which an initial symbolic factorization phase deter- 
mines the location of permissible entries using S{A}. This allows the memory 
requirements to be determined before an incomplete numerical factorization is 
performed. The specified set of positions is called the target sparsity pattern. 
A widely used example allows the incomplete factors to have entries only in 
the positions corresponding to S{A}. 


The basic dropping approaches can be combined and they can be employed in 
conjunction with discarding entries in A before the factorization commences. This 
initial sparsification is appealing because it may be possible to obtain an incomplete 
factorization by computing a complete factorization of the sparsified matrix. 
Sparsification can be performed by value or by position. Figure 9.1 illustrates 
sparsification of A after permuting it reveals a block structure (the permutation can 
be found using, for example, Algorithm 3.7 or 3.8). 


9.4.1 Incomplete Factorization Breakdown 


Dropping entries can lead to breakdown of the incomplete factorization, that is, a 
zero pivot may be encountered during the factorization (or a non-positive pivot in the 
Cholesky case). It is only possible to predict when this will happen in special cases, 
as stated in the following theorem, which is a consequence of the fact that being an 
M-matrix or an H-matrix is preserved in the sequence of the Schur complements 
during the factorization. This result does not hold for general SPD matrices. 


Theorem 9.4 (Meijerink & van der Vorst 1977; Manteuffel 1980; Varga et al. 
1980) Let A be a nonsingular M-matrix or H-matrix. If the target sparsity pattern 
of the incomplete factors contains the positions of the diagonal entries, then the 
incomplete factorization of A does not break down. 
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To illustrate the error accumulation in the incomplete factorization of an M- 
matrix using dropping, consider the example given in (9.14). Let E be the error 
matrix. E is initialized to zero, and at each stage of the factorization, the dropped 
entries are added into it. After one step of the complete factorization of A, the 
partially eliminated matrix A is 


4 =i =i 
3.75 —1 —0.25 -1 
-1 4 -1 
-0.25 3.75 -1 -l1 
Aa? = 1 1 4-1 1 
-1 -1 4 -1 
-1 4-1 


zi -14-1 
-1 -14 


Suppose the filled entries —0.25 in positions (2, 4) and (4, 2) are dropped. Then the 
values of the corresponding diagonal entries in the subsequent elimination matrices 
are larger than they would have been without any dropping. Furthermore, as all 
the off-diagonal nonzero entries are negative, for any target sparsity pattern the 
dropped entries are negative. The M-matrix property applies to all subsequent Schur 
complements, which implies that all the entries added into Æ are negative and 
so the absolute values of the entries in E grow as the factorization proceeds (the 
contributions can never cancel each other out). Thus, although the factorization does 
not break down, the growth in the error is potentially a problem for the accuracy of 
an incomplete factorization of an M-matrix. 


9.4.2 Perturbing Entries to Prevent Breakdown 


Modifying the diagonal entries of A is a common approach to avoid breakdown 
in an incomplete factorization. Breakdown is illustrated in Figure 9.2. A simple a 
posteriori remedy is to perturb the diagonal value that has caused breakdown. In 
this example, increasing a44 so that daa has a (small) positive value. Unfortunately, 
practical experience of making simple ad hoc modifications is generally not very 
positive. This is because making a local perturbation when breakdown occurs (or 
is close to occurring) may be too late for the resulting factorization to be good 
enough to be useful as a preconditioner (growth may already have happened in 
some of the factor entries). This applies to standard incomplete factorizations and 
to approximate inverses. 

An alternative and more effective strategy to avoid breakdown is to modify all 
the diagonal entries of A a priori and then compute an incomplete factorization 
of A + aI, where the shift a > 0 is a scalar parameter. It is always possible 
to find œ such that A + œI is nonsingular and diagonally dominant and is thus 
an H-matrix. However, being an H-matrix is not a necessary condition for a 
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Figure 9.2 An example to illustrate breakdown. The matrix A and its square root-free factors are 
given together with the incomplete factors L and D that result from dropping the entry l24 during 
the factorization. d44 = 0 means the incomplete factorization has broken down. 


ALGORITHM 9.1 Trial-and-error global shifted incomplete factorization 


Input: Matrix A, incomplete factorization algorithm, initial shift ae) 
Output: Shift œ and incomplete factors L and U such that A + a ~ LU 


1: fork =0,1,2,...do 


2: A+a®I x LU > Perform incomplete factorization 
3: If successful, a = oa and return 

4: aktD = Iq) 

5: end for 


matrix to be factorizable and, in practice, much smaller values of œ can provide 
incomplete factorizations for which || E || is small. A simple trial-and-error procedure 
for choosing a shift is given in Algorithm 9.1. The initial shift a) = 0 is reasonable 
if A is an SPD matrix or, more generally, has positive diagonal entries. If «© > 0 
and the incomplete factorization of A + œ® I is successful, then the algorithm can 
be modified to reduce a (for example, it could be replaced by œ ©) /2) and then 
restarted. The potential benefit is a smaller ||£|| (and hopefully a higher quality 
preconditioner) but at the cost of performing further incomplete factorizations. 
Observe that A should be prescaled to try and limit the size of «œ. 


9.4.3 Pivoting to Prevent Breakdown 


An alternative approach to avoid small pivots is to follow what is done in sparse 
direct solvers and incorporate partial or threshold pivoting within the incomplete 
factorization algorithm. This potentially makes the factorization significantly more 
expensive and much more complicated to implement efficiently. As with sparse 
direct solvers, preprocessing can limit the need for pivoting. If A is nonsymmetric, 
then row and column permutations can be used to bring large entries onto the 
diagonal before the factorization commences. In particular, the weighted matching 
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ordering and scaling discussed in Section 7.4.2 can be used. In the symmetric case, 
symmetry is preserved by choosing pivots from the diagonal. Again, the matrix 
should be prescaled, and then at each stage, a straightforward choice is to select as 
the next pivot the diagonal entry of the largest absolute value in the remaining active 
submatrix. If there is no suitable diagonal entry (for example, if the absolute values 
of all the remaining diagonal entries are less than some threshold), then either the 
diagonal can be modified or 2 x 2 pivots that preserve symmetry can be used. 

One way to attempt to minimize the norm of the error matrix E is to select 
the pivot candidate to minimize the sum of the absolute values of the dropped 
(discarded) entries. However, this minimum discarded fill ordering is typically too 
expensive to be useful in practice. 


9.5 Factorizations as Preconditioner Components 


Sometimes (incomplete) factorizations are employed as components in the construc- 
tion of more complex preconditioners. Here some possible approaches are briefly 
discussed. 


9.5.1 Polynomial Preconditioning 


Polynomial preconditioning selects a polynomial ¢ and applies a Krylov subspace 
method to solve either 


P(A)Ax = G(A)b 
(left preconditioning) or 
Ag(A)y=b, x=(A)y 


(right preconditioning). @ should be of small degree and chosen to enhance 
convergence. Consider the characteristic polynomial ġa (u) = det(A — uI) of A 
(det denotes the determinant). The Cayley—Hamilton theorem states that A satisfies 
its own characteristic equation so that 


on(A) = >) Bj A! =0, 
j=0 


where 6; (0 < j < n) are the coefficients of the characteristic polynomial 


(Bn = 1, Bo = (—1)” det(A)). Provided A is nonsingular, 
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A preconditioner can be constructed by taking the first k terms, possibly weighted 
by some suitable scalar coefficients, that is, 


k 
M7! = `X Yj A’, 
j=0 


An important question is why such a preconditioner can help in the presence 
of the optimality properties of Krylov subspace methods. For example, at iteration 
k + 1 of the CG method, x“t” satisfies 


x") = xO 4 g& (Ayr, k =0,1,..., 


where ¢, is a monic polynomial of degree k. This polynomial is optimal in the sense 
that x“+) minimizes 


leaae a. (9.16) 


A preconditioner that is a polynomial in A cannot speed the convergence because the 
resulting iteration again forms the new xt) as x plus a polynomial in A times 
r©), and thus the same or a higher degree polynomial is needed to achieve the same 
value of (9.16). Consequently, the number of matrix-vector multiplications cannot 
decrease. Nevertheless, polynomial preconditioning can be useful for a number of 
reasons. 


e The polynomial can improve the eigenvalue distribution of the preconditioned 
matrix and result in a reduction in the number of iterations required for 
convergence (even though the overall complexity may increase). 

e It requires very little memory and its implementation can be straightforward. 

e It can decrease the number of synchronization points in iterative methods as 
represented by inner products. This is potentially important for message-passing 
parallel architectures. 


Even if only a small number of terms are used in the polynomial approximating 
AT! a crucial issue is determining the coefficients yọ, ..., yz. A straightforward 
way of doing this is based on the Neumann series of a matrix C given by jo Ci, 
which is convergent if and only if o(C) < 1. In this case, 


+00 
-0 += c. (9.17) 
j=0 
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Now let M be a nonsingular matrix and œ > 0 a scalar such that the matrix C = 
I — wM~'A satisfies p(C) < 1. Using (9.17), 


+00 
Ay! = w(@oM!A TIM! =o —C)'!M! = w Dci ee 
j=0 


Truncating the summation gives as a possible preconditioner 
k 
M =|) C | M~. 
j=0 

Observe that 

k k 

1-M'A=I1-o@|)¢C!) m“a=1-|X a] a-0 =c, 
j=0 j=0 


which shows the positive effect of increasing k. If A and M are SPD matrices, then 
M can be used with the CG method preconditioned from the left because M7! A 
is self-adjoint in the M-inner product. Generalizations of the approach weight the 
powers of C in M~! using additional scalars. The choice of M is crucial for the 
effectiveness of the approach. 


9.5.2 Schur Complement Approach and Deflation 


Many contemporary preconditioners are constructed hierarchically. A straightfor- 
ward example is represented by the approximate solution of saddle point problems 
using the Schur complement approach. Consider the following general saddle point 


system: 
=f EC \ far). f 
eo (i C) 7 al E 


Assuming G is nonsingular, eliminating xı from the second block row yields the 
reduced system 


Sx2 = b2 — RG™'b;, (9.19) 
where S = B — RG™'C is the Schur complement of G in A. Solving (9.19) 


involves solving a linear system with G and with S. One option is to compute an 
LU factorization of G and then employ a preconditioned iterative method; this is 
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ALGORITHM 9.2 Simple Schur complement approach for saddle point systems 
Input: Nonsingular saddle point system (9.18) with G nonsingular. 
Output: Computed solution x. 


1: Compute LU factorization of G 

2: Solve Gz = bı > Use LU factors 
3: Compute Sols (B=RG Cy" > S—! chosen to approximate S~! 
4: Solve Sx2 = b2 — Rz > Use iterative method with M~! = 57! 
5: Solve Gx; = by — Cx2 > Use LU factors 


outlined in Algorithm 9.2. Combining direct and iterative techniques is sometimes 
referred to as a hybrid approach. 

The Schur complement (or substructuring) approach can be extended to matrices 
that are split into more blocks. Blocks may arise naturally from the underlying 
application, but they can also be defined using purely algebraic rules. For example, 
consider an SPD matrix A. Applying graph partitioning techniques (such as the 
nested dissection approach of Section 8.4) to the adjacency graph G(A), A can be 
symmetrically permuted to the doubly bordered block diagonal (DBBD) form 


where Gp is an SPD block diagonal matrix (Section 8.5.1). Apg is a special case 
of a symmetric saddle point matrix. A block LDLT factorization of A pg is given by 


I Gp I 6G, RT 
ADB = 4 f 
RG, I S I 


where the matrix S is the SPD Schur complement. The blocks within Gp can be 
factorized in parallel using a sparse Cholesky solver. However, S is typically large 
and significantly denser than B and, in large-scale practical applications, it may not 
be possible to explicitly assemble and factorize it; in this case, a preconditioned 
iterative method is needed. 

If S~! ~ S~!, then an approximate block factorization of A is 


—1 —1 
yi ~Yp RT\ (G5 N I 
I S~!) -RG 1 


Employing M7! as a preconditioner for A pg gives the preconditioned matrix 


ua I GB! RTU — S~! S) 
DB = e ; 
SIS 
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Applying M~! requires the efficient solution of linear systems with S-!S and 
Gp. As in other preconditioning approaches, bounding the condition number of 
the preconditioned matrix may be a useful indicator of the expected convergence 
of CG. The eigenvalues of M~!Apgz are those of S-!S and unity. Note that the 
spectrum of M~'App is the same as the spectrum of MT! Apg M". Thus, 
k(M7!? Apg M "P) depends on the extremal eigenvalues of S-15. A one-level 
preconditioner for S is obtained by setting 


S = B. 
Let the matrix B be m x m and let A; > --- > Am > O be the eigenvalues of the 
generalized eigenvalue problem 

Sz= Asiz. 


Because $7! S = I — BORG, R', it follows that 2; < 1 and so 


MS, Sy S18, SS, 


Mt 1 
yat<—, 

Am Am 
which is unbounded as àm approaches zero. In general, one-level algebraic precon- 
ditioners successfully bound the largest eigenvalues of the preconditioned matrix 
but encounter difficulties in controlling the smallest ones, which can lie close to the 
origin, hindering convergence. Strategies that involve a second-level component aim 
to overcome this and include deflation preconditioners and domain decomposition 
preconditioners. 

The basic idea behind deflation is to “hide” parts of the spectrum from the 
CG method such that the CG iteration “sees” a system that has a much smaller 
condition number and hopefully a more favourable eigenvalue distribution than 
the original matrix. The part of the spectrum that is hidden is determined by the 
deflation subspace and the improvement in the convergence rate of the deflated CG 
method depends on the choice of this subspace. The ideal deflation subspace is 
the invariant subspace spanned by the eigenvectors corresponding to the smallest 
eigenvalues. There are practical cases showing convergence of the preconditioned 
iterative method may profit from this restriction of the spectrum to its “effective” 
part. To illustrate the approach, let A be the k x k diagonal matrix with entries 
equal to the k smallest eigenvalues and let Z be the m x k matrix whose columns 
are the corresponding eigenvectors. A two-level deflation preconditioner is defined 
to be 


Sta B! + ZAT! =- DZ" = S! 42a 1=pz". 


In practice, challenges remain because A and Z are typically not readily available. 
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9.5.3 Domain Decomposition 


In the last section, the vertices V = {1,2,...,n} of G(A) were partitioned into 
non-overlapping subsets. Alternatively, overlapping subsets (which are generally 
termed subdomains because the approach was originally proposed for problems that 
had an underlying grid) may be used. Domain decomposition methods based on 
overlapping subdomains are often referred to as Schwarz methods. Given N > 1, 
let Qr; be the subset of size nr; of vertices that are distance one in G(A) from 
the vertices in Qz; (1 < i < N). The overlapping subdomain Q; is defined to be 
Q; = [Q7;, Qr;], with size nj = nri + ngi. 

Associate with Q; an n; x n restriction (or projection) matrix given by R; = 
[,(Q;, :). Rj maps from the global domain to subdomain Q;; its transpose RT isa 
prolongation matrix that maps from subdomain Q; to the global domain. The one- 
level additive Schwarz preconditioner is defined to be 


M7} = >» RTA'Ri, Ai = Ri ART. (9.20) 


Applying this preconditioner to a vector involves solving concurrent local problems 
in the overlapping subdomains. Increasing N reduces the sizes n; of the overlapping 
subdomains, leading to smaller local problems and faster computations. However, 
the preconditioned system using Mas may not be well conditioned and the 
convergence of the iterative solver may be inhibited. In fact, the local nature of 
this preconditioner can lead to a deterioration in its effectiveness as the number of 
subdomains increases because of the lack of global information from the matrix 
A. To maintain robustness with respect to N, an artificial subdomain is added to the 
preconditioner (also known as second-level or coarse space correction) that includes 
global information. Let O < no « n. If the nọ x n matrix Ro is of full row rank, the 
two-level additive Schwarz preconditioner is defined to be 


—1 —1 T 4—1 T 
Miso = Mas + Ro 4o Ro, Ao = RoARG. 


The coarse space correction can also be applied in a multiplicative way, which can 
lead to more robust variants. A sparse direct method can be used for the solves 
with each A;, which has the advantage of being robust and is another example of a 
hybrid approach. Alternatively, for very large systems, incomplete IC factorization 
preconditioners or approximate inverse preconditioners and an iterative method can 
be used. While this may result in a slower convergence rate, it can lead to a faster 
method overall because each iteration is less expensive (and may be the only option 
if the direct solver requires too much memory). Generalizing the approach to a 
hierarchy of additions of artificial domains leads to the class of multilevel methods. 
Again, employing them as preconditioners requires solves with the domain matrices, 
which can be based on sparse direct methods or preconditioned iterative methods. 
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An attractive feature of domain decomposition methods is that they are naturally 
parallel because all subdomain computations can be performed simultaneously. The 
restricted additive Schwarz preconditioner is obtained by a simple and efficient 
change that removes the overlap in the prolongation, replacing (9.20) by 


N 
-1 ST 4—1 
Mras = ` Ri A; Ri, 
i=l 


where R; = (Qr, :). The main motivation here is to reduce the communication 
cost by half because computing products such as R;w does not involve any data 
exchange with neighbouring processors. 


9.6 Notes and References 


A useful textbook on iterative methods is Saad (2003b). It includes the result stated 
in Theorem 9.3, while the proof of Theorem 9.2 is given in Mendelsohn (1956). 
Other key books include Meurant (1999), van der Vorst (2003), and the recent 
monograph of Bertaccini & Durastante (2018), as well as Liesen & Strakoš (2013) 
and Meurant & Duintjer Tebbens (2020), which targets theoretical and practical 
properties of iterative methods. The excellent surveys of Benzi (2002) and Wathen 
(2015), Pearson & Pestana (2020) present overviews of preconditioning techniques 
and the monograph Chen (2005) describes several approaches and includes many 
example applications, while Bollhöfer (2015) gives a practically oriented survey 
that mainly targets multilevel and parallel aspects of algebraic preconditioners. A 
discussion of the desirable properties of preconditioners can be found in Chow & 
Saad (1997). More sophisticated dropping strategies and the relation between ILU 
factorizations and factorized approximate inverses are considered by Bollhdfer & 
Saad (2002, 2006); while Kopal et al. (2016) discuss adaptive dropping. 

For a basic introduction to the stability problems of LU-based preconditioners, 
see Elman (1986, 1989). The Eisenstat trick of Section 9.2.3 is presented by 
Eisenstat (1981). An interesting discussion putting this into the context of other 
similar ideas is given in Ortega (1988a). 

The issue of potential breakdown during incomplete factorizations was pointed 
out by Kershaw (1978). This strengthened interest in classes of matrices for which 
breakdown cannot occur. Theorem 9.4 for M-matrices is from Meijerink & van der 
Vorst (1977); the extension to H-matrices is given independently by Manteuffel 
(1980) and Varga et al. (1980). Favourable asymptotic bounds for the condition 
number of M-matrices preconditioned by modified incomplete factorizations were 
an important impetus behind the development of algebraic preconditioners. These 
are described in Axelsson (1972) and Gustafsson (1978, 1979), but see also the 
early sophisticated analysis of relaxation methods presented in Dupont et al. (1968). 
Some of the assumptions that were used to obtain early asymptotic bounds were 
later shown to be unnecessary (Bern et al., 2006). Practical choices of polynomial 
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preconditioners, particularly for SPD systems, are discussed in the book by Saad 
(2003b) (and the earlier introductory paper of Saad, 1985). Note the recent interest 
of Loe & Morgan (2021) and Ye et al. (2021), the former motivated by the potential 
to reduce communication in parallel computing. 

For preconditioning saddle point problems using algebraic approaches, the 
highly cited survey of Benzi et al. (2005) and monograph of RozloZnik (2018) are 
good starting points. We also refer to the papers by MarySka et al. (1996, 2000a,b) 
and Arioli et al. (2006) on the iterative solution of algebraically preconditioned 
saddle point problems from PDE applications. 

There are a number of monographs on domain decomposition methods. An 
important algorithmically oriented introduction is Smith et al. (1996), but see 
also Quarteroni & Valli (1999) and Toselli & Widlund (2005) as well as the 
books by Olshanskii & Tyrtyshnikov (2014) and Dolean et al. (2015), which 
emphasize connections to PDEs and solution techniques motivated by them. We 
recommend the paper of Tang et al. (2009) for an algebraic comparison of different 
classes of domain decomposition and deflation preconditioners. A further line 
of research resulting in general algebraic preconditioners has been developed 
using hierarchical matrices; the papers include Bebendorf & Fischer (2008) and 
Bebendorf et al. (2013) and the monograph on hierarchical matrices of Bebendorf 
(2008). The ShyLU software package developed by Rajamanickam et al. (2012) 
is a fully algebraic hybrid package for solving sparse linear systems using domain 
decomposition methods. It offers distributed memory domain decomposition solvers 
and node level solvers and kernels that support the distributed memory solvers. The 
node level solvers include sparse LU and Cholesky factorizations, a multithreaded 
triangular solver, and a fast iterative ILU algorithm. ShyLU is available as part of 
Trilinos (ShyLU Project Team, 2022). 

Algebraic multigrid (AMG) methods are another important class of frequently 
used methods. AMG methods can be used to precondition a wide spectrum of 
problems, but their development has been mainly motivated by systems arising 
from the discretization of PDEs, often exploiting specific properties of discretized 
models. A recommended overview is by Xu & Zikatanov (2017); see also Stiiben 
et al. (2017). 
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Chapter 10 A) 
Incomplete Factorizations IICA 


They [incomplete factorizations] can be thought of as 
approximating the exact LU factorization of a given matrix A 
(e.g. computed via Gaussian elimination) by disallowing certain 
fill-ins. As opposed to other PDE-based preconditioners such as 
multigrid and domain decomposition, this class of 
preconditioners are primarily algebraic in nature and can in 
principle be applied to any sparse matrices. When applied to 
PDE problems, they are usually not optimal ... On the other 
hand, they are often quite robust. — Chan & van der Vorst 
(1997). 


Having introduced incomplete factorization preconditioners in the previous chapter, 
the focus in this chapter is on different ways to compute such factorizations and their 
relationship to the complete factorizations used in sparse direct methods. We denote 
the incomplete factors by L and U; in the SPD case, Ọ = LT. We assume that the 
sparsity patterns of A and its incomplete factors always include the positions of the 
diagonal entries. 


10.1 ILU(0) Factorization 


The simplest sparsity pattern for an incomplete factorization is S va +U } = S{A}, 
that is, no entries in L or U are allowed outside the sparsity pattern of A and 
only entries in positions (i, j) € S{A} are retained in the (incomplete) elimination 
matrices. The resulting incomplete factorization is called an ILU(0) factorization (or 
an IC(0) factorization if A is SPD). 

Motivation for considering a sparsity pattern that is a superset of S{A} is given 
by the following straightforward but important result. 


Theorem 10.1 (Chan & van der Vorst 1997 ; van der Vorst 2003) Consider the 
incomplete LU factorization A + E = LU with sparsity pattern S{L + U}. The 
entries of the error matrix E are zero at positions (i, j) € S{L + U}. 
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Proof The result clearly holds for j = 1. Let (i, j) € S {L +0 } and assume without 
loss of generality that i > j > 1. The (i, j) entry of L is computed as 


jai 
lij = | aij — ò lik tay | /ŭjj 
k=1 


with the sums over k implying (i, k) € S{L+U} and (k, DE S{L+U}. This gives 
aij = Lia:j-101:j-1,; + Lijit jj = Lia: O1:j,; = Lin:jUrj, j, 


and the corresponding entry of E is zero. o 


A consequence of Theorem 10.1 is that extending S 3 +U } gives a larger set of 
entries of A for which the error is zero. This is attractive provided the incomplete 
factorization can still be computed and employed cheaply and does not require 
prohibitive amounts of memory. In some situations, there are straightforward ways 
to extend S {L +0 }. For example, consider a simple discretization of a PDE on a 
rectangular grid. The sparsity pattern of the corresponding SPD matrix A and its 
graph G(A) together with the first three steps of the Cholesky factorization of A 
(in which variables 1, 2, and 3 are eliminated in turn) are given in Figure 10.1. A 
has entries on the diagonal and four of its subdiagonals and the fill-in lies within 
band(A). A natural choice is to allow S{L +U } to include fill-in along a few 
additional diagonals within the band. 
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Figure 10.1 An 8 x 8 banded sparse SPD matrix A and its graph G(A). The first three steps of a 
Cholesky factorization are shown. Filled entries are denoted by f. 
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10.2 Basic Incomplete Factorizations 


We start with the two basic incomplete factorizations. Here and elsewhere, section 
notation is used but operations are performed only on nonzero entries. The Crout 
variant given in Algorithm 10.1 computes U row-by-row and L column- -by-column 
and sparsifies each row and column as soon as they are computed using a target 
sparsity pattern S {L + U }. The widely used variant outlined in Algorithm 10.2 
constructs both È and U by rows. Prescribing an appropriate sparsity pattern in 
advance can be difficult. If it is not supplied, sparsification can be applied inside 
the k loops (for instance, entries with absolute value less than a chosen tolerance 
may be dropped) and the sparsity patterns of the factors updated as the factorization 
proceeds. 

Algorithms 10.1 and 10.2 are straightforward to implement using sparse data 
structures. At major step i, Algorithm 10.2 computes Liri- and Ui stn; both 
rows can be held using a single auxiliary vector. Note that, in Algorithm 10.1, 
sparsification of the partially computed vectors is performed outside the k loops, 
whereas in Algorithm 10.2 it is inside the k loop. In practice, either approach can be 
used, leading to slightly different variants. 


ALGORITHM 10.1 Crout incomplete LU factorization 
Input: Matrix A and, optionally, a target sparsity pattern S 1D +U } 
Output: Incomplete LU factorization A ~ LU. 


1: for j = 1:n do 


2 i=l, Hsin Aga j 

3 Uy, jin = Áj, jin 

4 fork=1: J — 1 such that (j, k) € S{L} do 

5: Uj. jin = = Uj, jin Lik Uk, jin > Sparse linear combination 
6 end for 

7 Sparsify Uj, j+ Lin > Drop entries from row j of U 
8: fork = 1: j — 1 such that (k, j) € S{U} do 

9 jeans = Leeind — Uj Lint > Sparse linear combination 
10: end for 

del: Sparsify L j+lingj > Drop entries from column j of L 
1: Bjinj =L j+in jjj 


13: end for 
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ALGORITHM 10.2 Row incomplete LU factorization 
Input: Matrix A and, optionally, a target sparsity pattern S {L +U }. 
Output: Incomplete LU factorization A ~ LU. 


1: for i=1:ndo 


2: jig = 1, Liti- = Aiki- 

3: Ü; in = Ainin 

4: Sparsify Diti and Uain 

5: fork = 1 : i — 1 such that (i, k) € S{L} do 
6: lik = lik /Ūkk 

7: Li k+ki-1 = Dieta — lig Ük ktr:i-1 
8: Sparsify Lik+ki—1 

9: Üi in = Ü; in — lig Ük,i:n 

10: Sparsify Uae 

11: end for 

12: end for 


10.3 Incomplete Factorizations Based on the Shortest 
Fill-Paths 


We next consider an incomplete LU factorization that uses a structure-based 
dropping strategy. Entries of the factors that correspond to nonzero entries of A are 
assigned the level 0, while each potential filled entry in position (i, j) is assigned a 
level as follows: 


level(i, j) = min ements, k) + level(k, j) + 1). (10.1) 


1<k<min{i, j} 


Given £ > 0, during the factorization, a filled entry is permitted at position 
(i, j) provided level(i, j) < £. The resulting level-based incomplete factorization 
is denoted by ILU(£) (or IC(é)); the basic row variant is given in Algorithm 10.3. 

Figure 10.2 depicts S{L + LT} for the IC(£) factorization of A from the 
discretized Laplace equation on a square grid (see the smaller problem in (9.14)) 
and for a matrix with a more general symmetric sparsity structure. The fill-in is 
typically generated irregularly throughout the factorization: initially few updates 
are needed, but later steps involve many updates, leading to large amounts of 
dropping. Furthermore, the amount of fill-in can grow quickly with increasing £ 
and, as a result, £ is typically small and level-based dropping is often combined with 
threshold-based dropping or with sparsifying A before the factorization commences 
(for example, by discarding entries of A with small absolute values). 
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ALGORITHM 10.3 Level-based incomplete LU factorization 
Input: Matrix A and the level parameter ¢ > 0. 
Output: ILU(¢) factorization A ~ LU. 


1: Initialise level to 0 for nonzeros and diagonal entries of A and to n+ 1 otherwise 
2: for i = 1 : n do > Loop over rows 
3: hij =l, Liti = Ai, 1:i—1 and Ui i:n = Åi i:n > Initialise row i of L and U 


4 for k = 1 : i — 1 such that level (i, k) < £ do 

5 hin = lig [itn 

6: for j =k+1:i-—1do 

J iij = hij — lik uz; and update /evel(i, j) 
8 end for 

9 for j =i:ndo 

10: aij = Ūij — liz ux; and update level(i, j) 
11: end for 

12; end for 

13: for k = 1 : i — 1 do > Drop entries in row i for which level is too high 
14: if level (i, k) > £ then /;, = 0 

15: end for 

16: fork =i:ndo 

17: if level (i, k) > £ then uj, = 0 

18: end for 

19: end for 


The level-based strategy comes from observing that in practical examples the 
absolute values of the entries in the factors in positions for which level is large are 
often small. This is the case for model problems arising from discretized PDEs. A 
closer look shows a surprising connection between the level-based ILU factorization 
and the complete factorization: entries with large values of level correspond to long 
fill-paths. This is expressed in Theorem 10.2, which allows the sparsity patterns of 
the incomplete factors to be determined a priori. 


Theorem 10.2 (Hysom & Pothen 2002) Consider the ILU(£) factorization of A. 
level(i, j) = k for some k < £ if and only if there is a shortest fill-path i => j of 
length k + 1 in the adjacency graph G(A). 


Algorithm 10.4 outlines finding the pattern of row i of U; finding the pattern of 
columns of L is analogous. Only G(A) is required, and hence the sparsity pattern 
of each row in the factor can be computed independently, in parallel. The algorithm 
operates via a simple breadth-first search that finds a shortest path between vertex 
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Figure 10.2 The sparsity patterns of the IC(¢) factors of A from the discretized Laplace equation 
on a square grid (top) and a more general symmetric sparse matrix (bottom). 


i and vertices reachable from i via a graph traversal of / + 1 or fewer edges. The 
correctness of the algorithm follows from Theorem 10.2. 


10.4 Modifications Based on Maintaining Row Sums 


We assume in this section that the target sparsity pattern S {2 +Õ0 } contains S{A}. 
Modified incomplete factorizations (MILU or MIC in the SPD case) seek to 
maintain equality between the row sums of A and LU, that is, LUe = Ae (e is the 
vector of all ones). Rather than discarding potential fill-in outside the target sparsity 
pattern, the approach subtracts it from the diagonal entries of U; this is outlined 
in Algorithm 10.5. Note that an MILU factorization may break down. If the target 
sparsity pattern corresponds to that of an ILU(£) factorization, then an MILU(£) 
factorization is computed. 

Equality of the row sums of A and LU can be seen as follows. If all the filled 
entries are retained (that is, S {T ae } = S{L + U}), then the claim holds trivially. 
Now assume some filled entries are not kept. If an entry in column j of row i of A 
belongs to the target sparsity pattern, then its value is modified in Step 8 ifi < j 
or in Step 15 if i > j. Otherwise, the i-th diagonal entry of U is modified (Step 
10 or Step 17). In each case, i; ik Ukj is subtracted from entries of the i-th row of the 
incomplete factors. Consider row i of LU. This product is given by 
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ALGORITHM 10.4 Find the sparsity pattern of row i of the ILU(¢) factor U 
of A 

Input: Graph G(A), the level parameter £ > 0 and row index i. ou 
Output: Sparsity pattern S{U; i:n} of row i of the ILU(¢) factorization A ~ LU. 


1: S{Ūiin} = {i}, Q= {i} > Queue holds 7 initially 
2: length(i) = 0 

3: visited(i) =i 

4: while Q is not empty do 

5: pop(Q, k) > Take k from the queue 
6: for j € adjg(a)(k) with visited(j) # i do 

T visited(j) =i 

8: if j < i and length(k) < £ then 

9: append (Q, j) > Add j to the queue 
10: length(j) = length(k) + 1 
11: else if j > i then 
12: StU in} = S{Ūi in} U {j} > Add j to the sparsity pattern of row i 
13: end if 


14: end for 
15: end while 


IM 
A 


i-1 n n i-1 
= (o- hiy) +S 5 tÈ |u- Zuan) 
j=1 k=1 


j=l k=j+1 k=i j=l 


Rearranging the indices in the double summations, the last three sums cancel out. 
Moreover, the added double summation is the sum of all the modification terms 
lik uz; in Algorithm 10.5, and the sum of the two subtracted double summations 
also comprises all the modification terms. Consequently, the row sums of A are 
preserved in the product of the incomplete factors. 

Theorem 10.3 provides motivation for maintaining constant row sums in the case 
of a model PDE problem. The result is also valid for Neumann or mixed boundary 
conditions, and there are extensions to three-dimensional problems and MIC(¢) 
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ALGORITHM 10.5 Modified incomplete factorization (MILU) 

Input: Matrix A = L4+D,+U,g (see (9.6)) and a target sparsity pattern StL Ù} 
containing S{A}. ies 

Output: Incomplete LU factorization A ~ LU. 


1: Ñj = (I + La); forall G, j) € S(L) > S(La) C S(L) 
2: ŭi; = (Da + Ua); for all (i, j) € SCO) > S(Ua) E S(Ù) 
3: for k = 1:n — 1 do 

4,  fori=k+ 1:n such that (i, k) € S{L} do 


5 biz = lik /itkk > Check that ñk is nonzero 
6 for j = i : n such that (k, j) € S{U} do 
7: if (i, j) € S{Ū} then 
8 Üi; = ij — lik tg 
9 else 
10: Uji = Uji — lik ux; œ> Modify diagonal instead of creating fill-in 
11: end if 
12: end for 
13: for j =k + 1 : i — 1 such that (k, j) € S{U} do 
14: if (i, j) € S{L} then 
15: lij = lij — lik tg 
16: else 
17: ÜŪii = Ui — liz ux; œ Modify diagonal instead of creating fill-in 
18: end if 
19: end for 
20: end for 


21: end for 


with £ > 0. However, although Theorem 10.1 holds for MILU factorizations, the 
approach may not be useful for general A. 


Theorem 10.3 (Gustafsson 1978; Bern et al. 2006) Let A come from a discretized 
Poisson problem on a uniform two-dimensional rectangular grid with Dirichlet 
boundary conditions and discretization parameter h. Then the condition number 
x((LU)~!A) for the level-based MIC(0) preconditioner is O(h"!). 


Optionally, in Steps 10 and 17 of Algorithm 10.5, the update term lik ux; may be 
multiplied by a parameter 0 (0 < 6 < 1) before it is subtracted from the diagonal 
entry üii. The introduction of 0 was proposed as a practical way to extend MILU to 
linear systems not coming from discretized PDEs. Clearly, using 0 < 1 reduces the 
amount by which the diagonal entries are modified. 
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10.5 Dynamic Compensation 


As discussed in Section 9.4.1, dropping entries can lead to breakdown. One way to 
avoid this (in exact arithmetic) is to dynamically modify the computed entries; this 
is outlined as Algorithm 10.6. Instead of accepting a filled entry in position (i, j), 
the idea is to add a (weighted) multiple of its absolute value to the corresponding 
diagonal entries u;; and uj;. Provided the number of modifications is small, this 
can be useful if A is a diagonally dominant matrix and scaled so that its diagonal 
entries are nonnegative. The parameter w controls the amount by which the diagonal 
entries of U are modified, but if @ < 1, then breakdown can still occur. Dynamic 
compensation can be successful when incorporated into an IC factorization of 


ALGORITHM 10.6 ILU factorization with dynamic compensation 

Input: Matrix A = La + Da + U4 (see (9.6)), a target sparsity pattern S{L + U} 
and parameter w (0 < w < 1). i 

Output: Incomplete LU factorization A ~ LU. 


1: Gj = (I+ La); forall (i, j) € S(Z) 

2: ñij = (Da + Ua)i; for all (i, j) € SO) 

3: for k = 1:n — 1 do 

4:  fori=k+ 1:n such that (i, k) € S{L} do 


5 lik = lik /ŭkk 

6 for j = i : n such that (k, j) € S{U} do 

7: if (i, j) € S{U} then 

8 ij = ij — lik tte 

9: else 

10: p = (Üu üj) 

1: fii = üi + op lik ükjl, jj = Hj; + ollk ül /P, Üj = 0. 
12: end if 

13: end for 

14: for j =k + 1 : i — 1 such that (k, j) € S{U} do 

15: if (i, j) € S{L} then 

16: lig = lij — lik tiny 

17: else 

18: p = (Üi [it jj) 

19: üu = üu + cop llir üxjl, jj = üjj + oliz üxjl /p, lij = 0. 
20: end if 

21: end for 


22: end for 
23: end for 
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an SPD matrix A because the resulting local modifications correspond to adding 
positive semidefinite matrices to A. In practice, the behaviour of the resulting 
preconditioner can be very different from that computed using the MIC approach 
of the previous section. 

A related scheme, called diagonally compensated reduction, modifies A before 
the factorization begins by adding the values of all of its positive off-diagonal entries 
to the corresponding diagonal entries and then setting these off-diagonal entries 
to zero. If A is SPD, then the resulting matrix is a symmetric M-matrix and the 
incomplete factorization will not break down (Theorem 9.4). However, the modified 
matrix may be too far from A for its incomplete factors to be useful. 


10.6 Memory-Limited Incomplete Factorizations 


We next consider a more sophisticated modification scheme that introduces the use 
of intermediate memory that is employed during the construction of the incomplete 
factors but is then discarded. The aim is to obtain a high quality preconditioner 
while maintaining sparsity and allowing the user to control how much memory is 
used (both in the construction of the preconditioner and in the incomplete factor L). 
Let the matrix A be SPD and consider the decomposition 


A=(Č+R (L+R - 


Here the incomplete factor L is a lower triangular matrix with positive diagonal 
entries, R is a strictly lower triangular matrix with “small” entries, and the error 
matrix is E = RRT. At each step, the next column of Lis computed, and then the 
remaining Schur complement is modified. On step j of the incomplete factorization, 
the first column of the Schur complement S$“ is split into the sum 


Ljn,j + Rje:n,j> 


where L; j:n,j contains the entries that are retained in column j of the final incom- 
plete factorization, (R); jj = 0 and R j+ln,j contains the entries that are discarded. 
If a complete factorization was being computed, then the Schur complement would 
be updated by subtracting 


7 > 7 > T 
(L j+inj + Rj+in,j) Ljtinj + Rj+in,j) 
However, the incomplete factorization discards the term 


GO — R, . RT 
EV = Rj+i:n,j Rigin, j 
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Figure 10.3 An illustration of the fill-in in a standard sparsification-based IC factorization (left) 
and in the approach that uses intermediate memory (right) after one step of the factorization. Entries 
with a small absolute value in row and column | are denoted by ô. The filled entries are denoted 


by f. 
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Figure 10.4 On the left is an SPD matrix with an entry of small absolute in positions (1, 3) and 
(3, 1). In the centre is S{L} computed using a standard IC factorization that drops the small entry 
ô at position (3, 1) (there are no filled entries in this case). On the right is the lower triangular 
part of the elimination matrix after the first step of the incomplete factorization using intermediate 
memory. The filled entry is denoted by f. 


Thus, the matrix EČ? is implicitly added to A, and because E“ is positive 
semidefinite, the approach is naturally breakdown-free. 

The obvious choice for R j+1:n, j 18 the smallest off-diagonal entries in the column 
(those that are smaller in absolute value than a chosen tolerance). Then implicitly 
adding EC? is combined with the standard steps of an IC factorization, with entries 
dropped from L after the updates have been applied to the Schur complement. 

Figure 10.3 depicts the first step of this approach. In the first row and column, 
x and ô denote the entries of L I:n,1 and Rint, respectively. Because a standard 
sparsification scheme does not store the smallest entries, using such a scheme gives 
no fill-in in the rows and columns corresponding to the discarded entries; this is 
shown on the left. The fill-in in the factorization that uses intermediate memory is 
depicted on the right. Clearly, more filled entries are used in constructing L. 

This strategy enables the structure of the complete factorization to be followed 
more closely than is possible using a standard approach. This is illustrated in 
Figure 10.4. If the small entries at positions (1, 3) and (3, 1) are not discarded, then 
there is a filled entry in position (3, 2) and this allows the incomplete factorization 
using intermediate memory to involve the (large) off-diagonal entries in positions 
(5, 2) and (6, 2) in the second step of the IC factorization. 

Unfortunately, because the column R j+1:n,j Must be retained to perform the 
updates on the next step, the total memory requirements are essentially as for a 
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ALGORITHM 10.7 Crout memory-limited IC factorization 
Input: SPD matrix A, memory control parameters /size > 0 and rsize > 0. 
Output: Incomplete Cholesky factorization A ~ LL’. 


1: wj=0, l<i<n 

2: for j = 1 : n do 

3 for i = j : n such that aj; 4 0 do 

4: Wi = aij 

5: end for 

6 for k < j such that lk # 0 do 

7 for i = j : n such that /j, 4 0 do 
8 
9 


wi = wi — lik ljk 


end for 
10: for i = j : n such that Fig # 0 do 
11: wi = wi — Fik ljk 
12: end for 


13: end for 
14: for k < j such that rj, 4 0 do 


15: for i = j : n such that /j, 40 do 

16: Wi = Wi — liz Tjk 

17: end for 

18: end for 

19: Copy into Linj the /size+nz(A j:n, j) entries of w of largest absolute value 

20: Copy into Riz 1:n, j the rsize entries of w that are the next largest in absolute 
value 

2: Seale dj = (wj), jtini = Djing Mii» Rising = Rising Mi 

22: Reset entries of w to zero. 

23: end for 

24: Optionally discard R > R is often only used in the construction of L 


complete factorization. However, the memory can be reduced by introducing two 
drop tolerances so that only entries of absolute value at least t; are kept in È 
and entries smaller than t2 are dropped from R. The factorization is no longer 
guaranteed to be breakdown-free. Furthermore, the numbers of entries in L and 
R are not known a priori. 

An alternative idea that limits both the number of entries in the incomplete 
factor and the intermediate memory is to fix the maximum number of entries in 
each column of L and R. This is outlined in Algorithm 10.7. Here lsize > 0 and 
rsize > 0 are the maximum number of filled entries in each column of L and 
the maximum number of entries in each column of R, respectively, and nz(A j:n,j) 
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denotes the number of entries in the lower triangular part of column j of A. The 
number of entries in L is less than nz(A) + (n— 1)lsize (where nz(A) is the number 
of entries in the lower triangular part of A) and R has at most (n— 1)rsize entries. If 
the parameter rsize is set to 0, then no intermediate memory is used but in general 
choosing rsize > 0 leads to the computed L being a higher quality preconditioner. 
In case of breakdown, the algorithm can incorporate the use of a global shift; see 
Algorithm 9.1. 


10.7 Fixed-Point Iterations for Computing ILU 
Factorizations 


The fixed-point ILU algorithm is fundamentally different from Gaussian 
elimination-based approaches. Given the target sparsity pattern S{L + U}, the 
goal is to iteratively generate incomplete factors fulfilling the ILU property 


(LU);; =a, (i, j) €S{L+0} 


(see Theorem 10.1). The idea is appealing because the entries of L and U can be 
computed iteratively in parallel using the constraints 


min(i, j) 
» lik kj = aij, (i, j) € S{L + U}, 
KEL a 
(i,k), (k, j)ES{L+U} 


and the normalization iP i = 1. Using the relations 
j-l 
ij = | aij Y Tin itey | Jit, i>j, (10.2) 
k=1 
i-1 
üj = aj- Y Ürü, i<j, (10.3) 
k=1 


the approach can be formulated as a fixed-point iteration method of the form w*t! = 
f (wk ), k = 0, 1,..., where w is a vector containing the unknowns lj and u;;. Each 
fixed-point iteration is called a sweep. Algorithm 10.8 outlines the method. 

An important question is how to choose initial values for the factor entries. 
In some applications, a natural initial guess is available. For example, in time- 
dependent problems, the L and U computed in the previous time step may provide 
appropriate initial guesses for the current time step. In other cases, a possible 
strategy is to symmetrically scale A to have a unit diagonal and then take the initial L 
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ALGORITHM 10.8 Fixed-point ILU factorization 


Input: Matrix A, the target sparsity pattern S {L +U }, and initial incomplete factors 
Land U. 
Output: Updated incomplete factors. 


for (i, j) € S{L + U} do 

Set /; j and u;; to the given initial values 
end for 
for sweep = 1,2,...do 

for (i, j) € S{L + U} do 


ifi > j then 
Compute hij using (10.2) 
else 
Compute #;; using (10.3) 
end if 
end for 
end for 


and U to be the lower and upper parts of the scaled matrix, respectively. In practice, 
a few sweeps may be sufficient to generate preconditioners that are competitive in 
terms of quality to those generated via classical incomplete Gaussian elimination 
algorithms. 

The following features differentiate the fixed-point ILU algorithm from classical 
methods and make it attractive for parallel computations on modern architectures. 


e The algorithm is fine-grained, allowing for scaling to a very large number of 
processors, limited only by the number of nonzero entries in the target sparsity 
patterns. 

e Preordering A is not needed to enhance parallelism, and thus orderings that 
improve the accuracy of the incomplete factorization can be used. 

e The algorithm can utilize an initial guess for the ILU factorization. 


To enhance the preconditioner quality, it is possible to interleave employing 
Algorithm 10.8 with a strategy that dynamically adapts S {L +0 } to the problem 
characteristics. In an iterative process based on highly parallel building blocks, 
this allows threshold-based ILU factorizations to be computed on parallel shared- 
memory architectures and enables the efficient use of streaming-based architectures 
such as GPUs. 
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10.8 Ordering in Incomplete Factorizations 


Ordering algorithms designed for sparse direct solvers (see Chapter 8) can have 
a positive effect on the robustness and performance of preconditioned Krylov sub- 
space methods. However, the best choice of ordering for an incomplete factorization 
preconditioner may not be the same as for a complete factorization, and although the 
effects of orderings and how much fill-in is allowed have been widely demonstrated, 
they are not yet fully understood. 

When the natural (lexicographic) ordering is used, the incomplete triangular 
factors resulting from a no-fill ILU factorization can be highly ill-conditioned, even 
if the matrix A is well-conditioned. Allowing more fill-in in the factors, for example, 
using ILU(1) instead of ILU(0), may solve the problem, but it is not guaranteed. In 
some cases, preordering A can lead to more stable factors, and hence more effective 
preconditioners, but, again, this is not understood. 

Minimum degree orderings (Section 8.1.2) are popular for direct methods, but 
for incomplete factorizations care is needed to ensure the dropping strategy is 
compatible with the ordering. This is because the rows (and columns) of the 
permuted matrix can have significantly different counts. In this situation, using 
memory-based dropping in which the maximum allowable number of filled entries 
in a row of L is the same for all rows may not be a good approach. An alternative 
strategy is to specify that the permitted fill-in is proportional to that of the complete 
factorization (which can be computed using Algorithm 4.3). 

A level set ordering that reduces the bandwidth or profile of a matrix can be 
employed (Section 8.2). For complete factorizations, the fill-in in the factors can 
be much greater than for nested dissection or minimum degree, but for incomplete 
factorizations they can be highly effective. In particular, using an RCM ordering 
(Algorithm 8.3) is often found to lead to a higher quality preconditioner than using 
the natural ordering. RCM-based orderings are generally inexpensive to compute 
and can provide good reuse of computer caches. 

Global orderings based on a divide-and-conquer approach and, in particular, 
nested dissection (Section 8.4) are important for complete factorizations. But such 
orderings cut local connections within the graph of A and, when used with incom- 
plete factorizations, can lead to poor quality preconditioners. A global ordering 
that specifically targets incomplete factorizations is a red—black (or checker board) 
ordering. Consider the graph G(A) of an SPD matrix A that arises from a simple 
5-point discretization of a PDE on a regular two-dimensional grid and colour its 
vertices using two colours so that no vertices of the same colour are incident to the 
same edge (see Figure 10.5). Because no red vertex is adjacent to any other red 
vertex, the red vertices are an independent set; similarly, the black vertices are an 
independent set. The red vertices can be processed in any order, provided they are 
all processed before any of the black vertices. This can make red—black orderings 
convenient for parallel implementations and is the main reason that they are often 
employed with stationary iterative methods. 
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Figure 10.5 A model problem to illustrate a red—black ordering. The grid-based graph G (A) with 
coloured vertices is given together with the matrix A (left) and the symmetrically permuted matrix 
using the red—black ordering (right). 


A bipartite graph is an undirected graph whose vertices can be partitioned into 
two disjoint sets such that each set is an independent set (Section 6.3.1). It follows 
that the red—black ordering exists if and only if G(A) is bipartite. The ordering is 
often generalized as follows. Start by finding a set of mutually non-adjacent vertices 
(that is, an independent set) and flag them as red vertices. After the elimination of the 
variables corresponding to the red vertices and employing a sparsification strategy, 
a Schur complement matrix is obtained. Proceed by finding a set of mutually non- 
adjacent vertices in this matrix, flag them as red vertices and continue recursively. 
This approach can lead to a significant decrease in the condition number of the 
preconditioned matrix. Another generalization for arbitrary graphs is to employ 
more colours (multicolouring). Again, the colouring can be exploited in parallel 
computations. For efficiency, load balancing of the coloured vertices needs to be 
considered. Because reordering the vertices can affect the convergence rate of an 
iterative solver, the potential gain in parallel performance at each iteration may be 
offset by a slower convergence rate. 


10.9 Exploiting Block Structure 


Blocking methods for complete factorizations can be adapted to incomplete factor- 
izations. The aim is to speed up the computation of the factors and to obtain more 
effective preconditioners. In a block factorization, scalar operations of the form 


lik = aik /tkk 
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are replaced by matrix operations 
F r7—1 
Lib,kb = Aib,kbUkp kb» 


and scalar multiplications of entries of the factors are replaced by matrix—matrix 
products. When dropping entries, instead of considering the absolute values, simple 
norms of the block entries (such as the one norm, max norm, or Frobenius norm) 
are used. 

An incomplete factorization can start with the supernodal structure of the 
complete factors. If dropping is applied to individual columns, this structure is 
generally lost. To try and retain it, the dropping strategy can be modified either 
to drop the set of nonzeros of a row in the current supernode or to keep it. To 
obtain sufficiently sparse incomplete factors, it may be necessary to subdivide each 
supernode, allowing greater flexibility on how many rows are dropped. It is also 
possible to relax blocking operations in such a way that the supernodes are not 
exact but are allowed to incur some fill-in. 


10.10 Notes and References 


Sparsity structure was the main ingredient of the first algebraic preconditioners that 
were developed in the late 1950s. The nonzero structure represented the stencils 
resulting from the discretization of PDEs on structured grids. The earliest contribu- 
tion is Buleev (1959), and this was later generalized to three-dimensional problems. 
An independent derivation and its interpretation as an incomplete factorization for a 
sparse matrix coming from a simple 5-point stencil is given in Varga (1960); other 
early work is by Baker & Oliphant (1960). For an overview of early contributions 
and the motivations behind incomplete factorizations, see Hin (1992); we also refer 
to the survey of Chan & van der Vorst (1997). 

Important breakthroughs in the use of preconditioning using incomplete factor- 
izations for practical problems came in two key papers. The first by Meijerink & 
van der Vorst (1977) recognized the importance of preconditioning for the conjugate 
gradient method. In the second, Kershaw (1978) proposed locally replacing pivots 
by a small positive number to prevent breakdown of the factorization. This paved 
the way for incomplete factorizations in which dropping is based solely on the size 
of the computed entries and which were introduced even earlier by Tuff & Jennings 
(1973). 

The Crout incomplete LU factorization outlined in Algorithm 10.1 was imple- 
mented in a successful code for symmetric problems by Lin & Moré (1999), 
building on earlier ideas of Jones & Plassmann (1995) and Eisenstat et al. (1982) 
(see also Li et al., 2003 for later contributions to this approach). Algorithm 10.2 
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with a sparsification strategy that uses both a drop tolerance and a limit on the 
number of entries in each column of the incomplete factors was published in Saad 
(1994a) as the dual threshold ILUT method. For general nonsymmetric matrices, 
ILUT has proved very popular and has been developed further (see, for example, 
MacLachlan et al., 2012). But because it is based on the row factorization, it 
ignores symmetry in A and, if A is symmetric, the computed sparsity patterns of 
L and UT are normally different. In this case, a Crout incomplete factorization 
may be preferable. The hierarchy of sparsity structures based on the concept of 
levels is introduced in Watts-III (1981). The initial work has since been significantly 
improved, notably for parallel implementations by Hysom & Pothen (2002). The 
Euclid library is a scalable implementation of a parallel level-based ILU algorithm 
that is available as part of the hypre library of linear solvers (see Falgout et al., 
2006, 2021). Scalable means that the incomplete factorization and triangular solve 
timings remain nearly constant when the problem size n is scaled in proportion to 
the number of processors. Another parallel level-based ILU preconditioner that uses 
an adaptive block implementation is proposed in Hénon et al. (2008). 

The modified incomplete factorizations of Section 10.4 are described in Saad 
(2003b). A proof of Theorem 10.3 can be found in Bern et al. (2006), but it is also 
of interest to follow earlier work on asymptotic bounds for the condition number 
of matrices preconditioned by modified incomplete factorizations given in Dupont 
et al. (1968), Axelsson (1972), and Gustafsson (1978), while an elegant description 
is in Meurant (1999). 

Incomplete factorizations with dynamic compensation originally introduced by 
Ajiz & Jennings (1984) have been routinely employed in practice. However, 
memory-limited approaches based on relaxing the strategy of Tismenetsky (1991) 
often lead to more efficient preconditioners; see Kaporin (1998) for a row-based 
construction that has recently been used by Konshin et al. (2017, 2019) to solve 
challenging practical problems. Scott & Tůma (2014b) present a Crout construction 
of a sophisticated memory-limited incomplete factorization and provide a robust 
implementation for SPD systems as the package HSL MI28 within the HSL 
mathematical software library (Scott & Tuma, 2014a); a variant for symmetric 
saddle point systems is also included in HSL. 

Using fixed-point iterations for the parallel computation of incomplete factor- 
izations is a relatively new idea that was proposed and analysed by Chow & Patel 
(2015). Interleaving a fixed-point iteration with a procedure that adjusts the sparsity 
pattern is proposed by Anzt et al. (2018). Other attempts to compute and use ILU 
preconditioners in parallel that build on the software package ILUPACK (Bollhéfer 
et al., 2012) are presented in Aliaga et al. (2016, 2019). A different approach to 
parallelize incomplete factorizations by relaxing supernodes is given by Gupta & 
George (2010). 

Significant attention has been devoted to using orderings of A to try and improve 
the quality of incomplete factorization preconditioners. An early and often quoted 
comparison of reorderings for SPD problems is by Duff & Meurant (1989). For 
more general matrices, see Benzi et al. (1999), Oliker et al. (2002), or Osei-Kuffuor 
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et al. (2015). Saad (1996a) and Saad & Zhang (1999) generalize red—black orderings 
and consider blocks and/or more colours; also of interest are the papers of Saad & 
Suchomel (2002), Li et al. (2003), and Carpentieri et al. (2014)). 
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Chapter 11 A 
Sparse Approximate Inverse P 
Preconditioners 


While it is recognized that preconditioning the system often 

improves the convergence of a particular method, this is not 
always so. In particular, a successful preconditioner for one 
class of problems may prove ineffective on another class. — 

Gould & Scott (1998). 


There is, of course, no such concept as a best preconditioner ... 
However, every practitioner knows when they have a good 
preconditioner which enables feasible computation and solution 
of problems. In this sense, preconditioning will always be an art 
rather than a science. — Wathen (2015). 


Consider a preconditioner M based on an incomplete LU (or Cholesky) factorization 
of a matrix A. M~!, which represents an approximation of A~!, is applied by 
performing forward and back substitution steps; this can present a computational 
bottleneck. An alternative strategy is to directly approximate A~! by explicitly com- 
puting M~!. Preconditioners of this kind are called sparse approximate inverse 
preconditioners. They constitute an important class of algebraic preconditioners 
that are complementary to the approaches discussed in the previous chapter. They 
can be attractive because when used with an iterative solver, they can require 
fewer iterations than standard incomplete factorization preconditioners that contain 
a similar number of entries while offering significantly greater potential for parallel 
computations. 

From Theorem 7.3, the sparsity pattern of the inverse of an irreducible matrix A 
is dense, even when A is sparse. Therefore, if A is large, the exact computation of its 
inverse is not an option, and aggressive dropping is needed to obtain a sufficiently 
sparse approximation to A~! that can be used as a preconditioner. Fortunately, 
for a wide class of problems of practical interest, many of the entries of A~! are 
small in absolute value, so that approximating the inverse with a sparse M~! may 
be feasible, although capturing the large (important) values of AT! is a nontrivial 
task. Importantly, the computed M~! can have nonzeros at positions that cannot 
be obtained by either a complete or an incomplete factorization, and this can be 
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beneficial. Furthermore, although AT! is fully dense, the following result shows 
this is not the case for the factors of factorized inverses. 


Theorem 11.1 (Bridson & Tang 1999; Benzi & Tůma 2000) Assume the matrix 
A is SPD, and let L be its Cholesky factor. Then S{L~'} is the union of all entries 
(i, j) such that i is an ancestor of j in the elimination tree T (A). 


A consequence of this result is that L~! need not be fully dense. Considering this 
implication algorithmically, if A is SPD, it may be advantageous to preorder A to 
limit the number of ancestors that the vertices in 7 (A) have. For example, nested 
dissection may be applied to S{A} (Section 8.4). If S{A} is nonsymmetric, then it 
may be possible to reduce fill-in in the factors of A~! by applying nested dissection 
to S{A + AT}. 


11.1 Basic Approaches 


An obvious way to obtain an approximate inverse of A in factorized form is to 
compute an incomplete LU factorization of A and then perform an approximate 
inversion of the incomplete factors. For example, if incomplete factors L and U 
are available, approximate inverse factors can be found by solving the 2n triangular 
linear systems 


~ 


Lxi =e, Uy =e, laizsnm, 


where e; is the i-th column of the identity matrix. These systems can all be solved 
independently, and hence, there is the potential for significant parallelism. To reduce 
costs and to preserve sparsity in the approximate inverse factors, they may not need 
to be solved accurately. A disadvantage is that the computation of the preconditioner 
involves two levels of incompleteness, and because information from the incomplete 
factorization of A is passed into the second step, the loss of information can be 
excessive. 

Another straightforward approach is based on bordering. Let A; denote the 
principal leading submatrix of A of order j (Aj = Aj.;,1:;), and assume that its 
inverse factorization 


A cael 
Aj! = W,D;'Z) 


is known. Here W; and Zj are unit upper triangular matrices, and Dj; is a diagonal 
matrix. Consider the following scheme: 


T 

Z; 0 Aj Ai:j,j+1 Wj Wj+l Dj 0 

T e ri 
Zj I) \Ajsiay aj+ij+/ (Oo 1 0 djy, j+1 


11.2 Approximate Inverses Based on Frobenius Norm Minimization 207 


where frl <j <n 
wei = W; D7 ' ZT Arj j 
j+1 = DEJ A ARRITI 
_ p-lwT at 
Zjt1 = —ZjD; Wj Ajjaj 


2 T T 
dj+1,j+1 = 4j+,j+1 + 254, jW + Ajy, j Wj + BAL j+ 


Starting from j = 1, this suggests a procedure for computing the inverse factors of 
A. Sparsity can be preserved by dropping some entries from the vectors wj+1 and 
zj+1 Once they have been computed. Sparsity and the quality of the preconditioner 
can be influenced by preordering A. 

If A is symmetric, W = Z and the required work is halved. Furthermore, if A is 
SPD, then it can be shown that, in exact arithmetic, dj; > 0 for all j and the process 
does not break down. In the general case, diagonal modifications may be required, 
which can limit the effectiveness of the resulting preconditioner. 

Observe that the computations of Z and W are tightly coupled, restricting the 
potential to exploit parallelism. At each step j, besides a matrix-vector product 
with Aj, four sparse matrix-vector products involving W;, Z; and their transposes 
are needed; these account for most of the work. The implementation is simplified if 
access to the triangular factors is available by columns as well as by rows. 


11.2 Approximate Inverses Based on Frobenius Norm 
Minimization 


It is clear from the above discussion that alternative techniques for constructing 
sparse approximate inverse preconditioners are needed. We start by looking at 
schemes based on Frobenius norm minimization. Historically, these were the first 
to be proposed and offer the greatest potential for parallelism because both the 
construction of the preconditioner and its subsequent application can be performed 
in parallel. 


11.2.1 SPAI Preconditioner 


To describe the sparse approximate inverse (SPAI) preconditioner, it is convenient 
to use the notation K = M~!. The basic idea is to compute K ~ A`! with its 
columns denoted by k; as the solution of the problem of minimizing 


n 
II — AM" = |I — AKI = J. llej — Akj Iĝ, (11.1) 
j=l 
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over all K with pattern S. This produces a right approximate inverse. A left 
approximate inverse can be computed by solving a minimization problem for || Z — 
KA\|r = || — A’ KT ||. This amounts to computing a right approximate inverse 
for A’ and taking the transpose of the resulting matrix. For nonsymmetric matrices, 
the distinction between left and right approximate inverses can be important. Indeed, 
there are situations where it is difficult to compute a good right approximate inverse 
but easy to find a good left approximate inverse (or vice versa). In the following 
discussion, we assume that a right approximate inverse is being computed. 

The Frobenius norm is generally used because the minimization problem then 
reduces to least squares problems for the columns of K that can be computed 
independently and, if required, in parallel. Further, these least squares problems are 
all of small dimension when S is chosen to ensure K is sparse. Let J = {i | k; (i) # 
0} be the set of indices of the nonzero entries in column kj. The set of indices of 
rows of A that can affect a product with column k; is Z = {m | Ayn7 4 O}. Let |Z] 
and |J | denote the number of entries in Z and J, respectively, and let é; = e;(Z) be 
the vector of length |Z] that is obtained by taking the entries of e; with row indices 
belonging to Z. To solve (11.1) for kj, construct the |Z| x |.7| matrix A= ATI 
and solve the small unconstrained least squares problem 


min le; — Ak I3. (11.2) 
Ej 


This can be done using a dense QR factorization of A. Extending kj to have length 
n by setting entries that are not in J to zero gives kj. 

A straightforward way to construct S that does not depend on a sophisticated 
initial choice (but could, for example, be the identity or be equal to S{A}) proceeds 
as follows. Starting with a chosen column sparsity pattern J for kj, construct A, 
solve (11.2) for k; jo Setk (J) = kj , and define the residual vector 


rj = ej — Ain, kj. 
If |Irjll2 # 0, then k; is not equal to the j-th column of A—!, and a better 


approximation can be derived by augmenting J. To do this, let £ = {1 |r; (D 4 0} 
and define 


J = {il Aci FONT. (11.3) 
These are candidate indices that can be added to J, but as there may be many of 
them, they need to be chosen to most effectively reduce ||r; ||2. A possible heuristic 


is to solve for eachi € J the minimization problem 


l 2 
min Ilrj — mi Aei ll5. 
l 
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This has the solution u; = r7 Aei / || Aei ||} with residual ||r; ||? — (rj Aei)? /\\Aeill5. 
Indices i € J for which this is small are appended to 7. The process can be repeated 
until either the required accuracy is attained or the maximum number of allowed 
entries in J is reached. 

Solving the unconstrained least squares problem (11.2) after extending A to 
Azur’, Jug’ İS typically performed using updating. Assume the QR factorization 


of A is 
Adee =O (5) = (0; ©») (o) 


where Q is |Z| x |J |. Here Q is an orthogonal matrix and R is an upper triangular 
matrix. The QR factorization of the extended matrix is 


+ R Qj Az, 7 
AA / l ? 
ATUT’, Jug! = ( ~] = e P OF AT, J 


AT 
a At’, 7’ 


where Q’ and R’ are from the QR factorization of the (|Z’| + |Z| — IJI) x |’! 


matrix 
e AT, A l 
AT, T! 


Factorizing this matrix and updating the trailing QR factorization to get the new kj 
is much more efficient than computing the QR factorization of the extended matrix 
from scratch. 

Construction of the SPAI preconditioner is summarized in Algorithm 11.1. The 
maximum number of entries nz; that is permitted in k; must be at least as large 
as the number of entries in the initial sparsity pattern Jj. Updating can be used to 
compute a new kj for each pass through the while loop; the number of passes is 
typically small (for example, if a good initial sparsity pattern is available, a single 
pass may be sufficient). 

The example in Figure 11.1 illustrates Algorithm 11.1. Starting with a tridiagonal 
matrix, it considers the computation of the first column kı of the inverse matrix K. 
The algorithm starts with 7 = {1, 2}. 

When A is symmetric, there is no guarantee that the computed K will be 
symmetric. One possibility is to use (K + K")/2 to approximate A~'. The SPAI 
preconditioner is not sensitive to the ordering of A. This has the advantage that 
A can be partitioned and preordered in whatever way is convenient, for instance, 
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ALGORITHM 11.1 SPAI preconditioner (right-looking approach) 

Input: Nonsymmetric matrix A, a convergence tolerance n > 0, an initial sparsity 
pattern Jj and the maximum number nz; of permitted entries for column j of K 
(l<j<n). 

Output: K ~ A`! with columns k; (1 < j < n). 


1: for j = 1 : n do > The columns may be computed in parallel 
2: Set J = Jj and T = {m | A(m, J) £ 0, Irjll2 = œ 
3: Construct A = Az,7 and solve (11.2) for kj 
4: rj =ej- Atn, 7 kj 
5: while |J| < nzj and ||r;|l2 > n do 
6: Construct a given by (11.3) > J is the candidate set 
7: Determine new indices J’ C J to add to 7 
8: T = {m| Am7 # O}\L 
9: T=IUT adJ = JUJ > Augment the sparsity pattern 
10: Construct new A = Az. 7 and new kj > Update the QR factorization 
11: rj =ej — Aten gk; 
12: end while 
13: k;(J) = kj > Extend kj to k; by setting entries not in J to zero. 
14: end for 
10 -2 1.00 x 107" 
se iua ja (“ ") i= (20000) 9 = | an 30 
-1 10 -2 -1 0 
—1 10 0 
—5 
pi ie m ee 0.1021 ii : 10-4 a 0108 
A= tio (ahem (emos) n= [11x 107° , kı = | 0.0010 
L1 0.0010 1.0 x 1074 0 
0 0 


Figure 11.1 An illustration of computing the first column of a sparse approximate inverse using 
the SPAI algorithm with nz; = 3. On the top line is the initial tridiagonal matrix A followed by the 
matrix A and the vectors k and rı on the first loop of Algorithm 11.1. The bottom line presents 
the updated matrix A that is obtained on the second loop by adding the third row and column of 
A and the corresponding vectors fı and rı and, finally, kı. Here the numerical values have been 
appropriately rounded. 


to better suit the needs of a distributed implementation, without worrying about 
the impact on the subsequent convergence rate of the solver. The disadvantage is 
that orderings cannot be used to reduce fill-in and/or improve the quality of this 
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approximate inverse. For instance, if A~! has no small entries, SPAI will not find 
a sparse K, and because the inverse of a permutation of A is just a permutation of 
A7!, no permutation of A will change this. 


11.2.2 FSAI Preconditioner: SPD Case 


We next consider a class of preconditioners based on an incomplete inverse factor- 
ization of A~!. The factorized sparse approximate inverse (FSAI) preconditioner 
for an SPD matrix A is defined as the product 


M'=G'G, 


where the sparse lower triangular matrix G is an approximation of the inverse of 
the (complete) Cholesky factor L of A. Theoretically, a FSAI preconditioner is 
computed by choosing a lower triangular sparsity pattern Sz and minimizing 


I7 - GLI; =tr[U- GL) - GD], (11.4) 
over all G with sparsity pattern Sz. Here tr denotes the matrix trace operator (that 
is, the sum of the entries on the diagonal). The computation of G can be performed 
without knowing L explicitly. Differentiating (11.4) with respect to the entries of G 
and setting to zero yields 


(GLL');; = (GA)ij = (L7); forall (i, j) € Sz. (11.5) 


Because L" is an upper triangular matrix while Sz is a lower triangular pattern, the 
matrix equation (11.5) can be rewritten as 


0 iFj, GNHESL, 


i i=j 


(GA)ij = (11.6) 


G is not available from (11.6) because L is unknown. Instead, G is computed such 
that 
(GA)i; =6;; forall (i, j)eSL, (11.7) 


where ô; j is the Kronecker delta function (ô; j = 1 if i = j and is equal to 0, 
otherwise). The FSAI factor G is then obtained by setting 


G = DG, 
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where D is a diagonal scaling matrix. An appropriate choice for D is 
D = [diag(G)|"'”, (11.8) 
so that 
(GAG); =1, 1<i<n. 


The following result shows that the FSAI preconditioner exists for any nonzero 
pattern Sz that includes the main diagonal of A. 


Theorem 11.2 (Kolotilina & Yeremin 1993) Assume A is SPD. If the lower 
triangular sparsity pattern S, includes all diagonal positions, then G exists and 
is unique. 


Proof Set Z; = {j | (i, j) € SL}, and let Az,,z, denote the submatrix of order 
nzi = |Z;| of entries agı such that k, l € Z;. Let g; and g; be dense vectors containing 
the nonzero coefficients in row i of G and G, respectively. Using this notation, 
solving (11.7) decouples into solving n independent SPD linear systems 


AT, T gi = nzi» 1 <i<n, 
where the unit vectors are of length nz;. Moreover, 


=, 5T Zz Zz _ 
(GAG Ji = J ôi jG = Gu = (AZ. 
jeTj 


This implies that the diagonal entries of D given by (11.8) are nonzero. Conse- 
quently, the computed rows of G exist and provide a unique solution. o 


The procedure for computing a FSAI preconditioner is summarized in Algo- 
rithm 11.2. The computation of each row of G can be performed independently; 
thus, the algorithm is inherently parallel, but each application of the preconditioner 
requires the solution of triangular systems. 

The following theorem states that G computed using Algorithm 11.2 is in some 
sense optimal. 


Theorem 11.3 (Kolotilina et al. 2000) Let L be the Cholesky factor of the SPD 
matrix A. Given a lower triangular sparsity pattern S, that includes all diagonal 
positions, let G be the FSAI preconditioner computed using Algorithm 11.2. Then 
any lower triangular matrix G, with its sparsity pattern contained in S, and 
(Gi AG )ii =1(1 <i <n) satisfies 


IZ — GL |r < | — GiLllr. 
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ALGORITHM 11.2 FSAI preconditioner 

Input: SPD matrix A and a lower triangular sparsity pattern Sz that includes all 
diagonal positions. 

Output: Lower triangular matrix G such that A~! ~ GGT. 


1: for i = 1 : n do 

2 Construct Z; = {j | (i, j) € SL}, AT, T, and set nz; = |Z;| 

3 Solve Az, T, 81 = ĉênz; 

4: Scale g; = diigi with dii = (Binz) Y? > gi, nz; is the last component of g; 
5 Extend g; to the row G;,;.; by setting entries that are not in Z; to zero 

6: end for 


The performance of the FSAI preconditioner is highly dependent on the choice 
of SL. If entries are added to the pattern, then, as the following result shows, the 
preconditioner is more accurate, but it is also more expensive. 


Theorem 11.4 (Kolotilina et al. 2000) Let L be the Cholesky factor of the SPD 
matrix A. Given the lower triangular sparsity patterns Sı and Sz that include 
all diagonal positions, let the corresponding FSAI preconditioners computed using 
Algorithm 11.2 be G, and G2, respectively. If Stı © Sy2, then 


IZ — GoL||r < |[2 — GiL||r. 


11.2.3 FSAI Preconditioner: General Case 


The FSAI algorithm can be extended to a general matrix A. Two input sparsity 
patterns are required: a lower triangular sparsity pattern Sz and an upper triangular 
sparsity pattern Sy, both containing the diagonal positions. First, lower and upper 
triangular matrices Gz and Gy are computed such that 


(GLA)i; =6,; forall (i, j)e€ SL, 
(AGu)ij = ôi, j forall (i, j) € Su. 


Then D is obtained as the inverse of the diagonal of the matrix G,AGy, and the 
final nonsymmetric FSAI factors are given by G; = Gz and Gy = GyD. The 
computation of the two approximate factors can be performed independently. 

This generalization is well defined if, for example, A is nonsymmetric positive 
definite. There is also theory that extends existence to special classes of matrices, 
including M- and H-matrices. In more general cases, solutions to the reduced 
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systems may not exist, and modifications (such as perturbing the diagonal entries) 
are needed to circumvent breakdown. 


11.2.4 Determining a Good Sparsity Pattern 


The role of the input pattern is to preserve sparsity by filtering out entries of A~! 
that contribute little to the quality of the preconditioner. For instance, it might be 
appropriate to ignore entries with a small absolute value, while retaining the largest 
ones. Unfortunately, the locations of large entries in AT! are generally unknown, 
and this makes the a priori sparsity choice difficult. A possible exception is when 
A is a banded SPD matrix. In this case, the entries of AT! are bounded in an 
exponentially decaying manner along each row or column. Specifically, there exist 
0 < p < 1 and a constant c such that for all i, j 


ATDI < coi, 


The scalars o and c depend on the bandwidth and the condition number of A. For 
matrices with a large bandwidth and/or a high condition number, c can be very large 
and p close to one, indicating extremely slow decay. However, if the entries of A~! 
can be shown to decay rapidly, then a banded M~! should be a good approximation 
to A7!. In this case, Sz can be chosen to correspond to a matrix with a prescribed 
bandwidth. 

A common choice for a general A is Sp +Sy = S{A}, motivated by the empirical 
observation that entries in A~! that correspond to nonzero positions in A tend to 
be relatively large. However, this simple choice is not robust because entries of 
A`! that lie outside S{A} can also be large. An alternative strategy based on the 
Neumann series expansion of AT! is to use the pattern of a small power of A, 
i.e., S{A7} or S{A%}. By starting from the lower and upper triangular parts of A, 
this approach can be used to determine candidates Sz and Sy. While approximate 
inverses based on higher powers of A are often better than those corresponding 
to A, there is no guarantee they will result in good preconditioners. Furthermore, 
even small powers of A can be very dense, thus slowing down the construction 
and application of the preconditioner. A possible remedy is to use the power of a 
sparsified A. Alternatively, the pattern can be chosen dynamically by retaining the 
largest terms in each row of the preconditioner as it is computed, which is what 
the SPAI algorithm does. Another possibility is to implicitly determine Sz + Sy as 
follows. Starting with a simple sparsity pattern, compute the corresponding FSAI 
preconditioner G1. Then choose a pattern based on G\ AGT and apply the FSAI 
algorithm to G\ AGT to obtain G2. Finally, set the preconditioner to G2G 1. Despite 
running the FSAI algorithm twice, this approach can be worthwhile. Unfortunately, 
the choice of the best technique for generating a FSAI preconditioner and its sparsity 
pattern is highly problem dependent. 
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11.3 Factorized Approximate Inverses Based on Incomplete 
Conjugation 


An alternative way to obtain a factorized approximate inverse is based on incom- 
plete conjugation (A-orthogonalization) in the SPD case and on incomplete A- 
biconjugation in the general case. For SPD matrices, the approach represents an 
approximate Gram-Schmidt orthogonalization that uses the A-inner product (., .) 4. 
An important attraction is that the sparsity patterns of the approximate inverse 
factors need not be specified in advance; instead, they are determined dynamically 
as the preconditioner is computed. 


11.3.1 AINV Preconditioner: SPD Case 


When A is an SPD matrix, the AINV preconditioner is defined by an approximate 
inverse factorization of the form 


A7! x M7! = ZD! Z", 


where the matrix Z is unit upper triangular and D is a diagonal matrix with positive 
entries. The factor Z is a sparse approximation of the inverse of the L7 factor in the 
square root-free factorization of A. Z and D are computed directly from A using 
an incomplete A-orthogonalization process applied to the columns of the identity 
matrix. If entries are not dropped, then a complete factorization of A~! is computed 
and Z is significantly denser than LT. To preserve sparsity, at each step of the 
computation, entries are discarded (for example, using a prescribed threshold, or 
according to the positions of the entries, or by retaining a chosen number of the 
largest entries in each column), resulting in an approximate factorization of A~!. 

There are several variants. Algorithms 11.3 and 11.4 outline left-looking and 
right-looking approaches, respectively. Practical implementations need to employ 
sparse matrix techniques. The left-looking scheme computes the j-th column z; of 
Z as a sparse linear combination of the previous columns z1, ...,zj—1. The key 
is determining which multipliers (the œ’s in Steps 4 and 5 of the two algorithms, 
respectively) are nonzero and need to be computed. This can be achieved very 
efficiently by having access to both the rows and columns of A (although the 
algorithm does not require that A is explicitly stored—only the capability of forming 
inner products involving the rows of A is required). For the right-looking approach, 
the crucial part for each j is the update of the sparse submatrix of Z composed of 
the columns j + 1 to n that are not yet fully computed. Here, only one row of A is 
used in the outer loop of the algorithm. Therefore, A can be generated on-the-fly by 
rows. The DS format can be used to store the partially computed Z (Section 1.3.2). 
As with complete factorizations, the efficiency of the computation and application 
of AINV preconditioners can benefit from incorporating blocking. 
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ALGORITHM 11.3 AINV preconditioner (left-looking approach) 

Input: SPD matrix A and sparsifying rule. 

Output: A~! ~ ZD~'Z? with Z a unit upper triangular matrix and D a diagonal 
matrix with positive diagonal entries. 


0 0 
Ley suse | 


1: = [e1,..., en] > Initialise Z to hold the columns of the 
identity matrix 

2: for j = 1 : n do 
3 for k = 1 : j — 1 do 

k-1 
4 a = Ag in 2  /dkk 

f 7 = zk 1) (k—1) 
= zj Zj TAk 
6 Sparsify zí Ma > Drop entries from a 
7 end for 
a 1) 

9: end ie 


10: Z=[20,..., 20°77] 


o0 


11.3.2 AINV Preconditioner: General Case 


In the general case, the AINV preconditioner is given by an approximate inverse 
factorization of the form 


A7! x M7! = WD!ZT, 


where Z and W are unit upper triangular matrices and D is a diagonal matrix. Z and 
W are sparse approximations of the inverses of the LT and U factors in the LDU 
factorization of A, respectively. Starting from the columns of the identity matrix, 
A-biconjugation is used to compute the factors. Algorithm 11.5 outlines the right- 
looking approach. Note it offers two possibilities for computing the entries djj of 
D that are equivalent in exact arithmetic if the factorization is breakdown-free. The 
left-looking variant given in Algorithm 11.3 can be generalized in a similar way. 
Figure 11.2 illustrates the sparsity patterns of the AINV factors for a matrix 
arising in circuit simulation. S{A} is symmetric, but the values of the entries of 
A are nonsymmetric. The sparsity pattern S{W + ZT} is given, where W and 
Z are computed using Algorithm 11.5 with sparsification based on a dropping 
tolerance of 0.5. Also given are the patterns S {L + U} and S{L~ 140!) for 
the incomplete factors L and i computed using Algorithm 10.2 (see Section 10.2) 
with a dropping tolerance of 0.1 and at most 10 entries in each row of L+Ū. 
Note that this dual dropping strategy is one of the most popular ways of employing 
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Figure 11.2 An example to illustrate the difference between the sparsity patterns of the AINV 
factors and those of the inverse of the ILU factors. The sparsity pattern S{A} of the matrix A is 
given (top left) together with the patterns of the factorized approximate inverse factors S{W + Z ry 
(top right), the ILU factors S{L + U} (bottom left), and their inverses S {L-! + U-!} (bottom 
right). 


Algorithm 10.2; it is often denoted as ILUT(p, t), where p is the maximum number 
of entries allowed in each row and t is the dropping tolerance. In this example, the 
parameters were chosen so that the number of entries in both W + Z7 and jae os 
is approximately equal, but the resulting sparsity patterns are clearly different. In 
particular, potentially important information is lost from S {2 + 07t}. 


11.3.3 SAINV: Stabilization of the AINV Method 


The following result is analogous to Theorem 9.4. 


Theorem 11.5 (Benzi et al. 1996) If A is a nonsingular M- or H-matrix, then the 
AINV factorization of A does not break down. 


For more general matrices, breakdown can happen because of the occurrence of 
a zero dj; or, in the SPD case, negative dj;;. In practice, exact zeros are unlikely 
but very small dj; can occur (near breakdown), which may lead to uncontrolled 
growth in the size of entries in the incomplete factors and, because such entries 
are not dropped when using a threshold parameter, a large amount of fill-in. The 
next theorem indicates how breakdown can be prevented when A is SPD through 
reformulating the A-orthogonalization. 
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ALGORITHM 11.4 AINV preconditioner (right-looking approach) 

Input: SPD matrix A and sparsifying rule. 

Output: A~! ~ ZD~'Z? with Z a unit upper triangular matrix and D a diagonal 
matrix with positive diagonal entries. 


1: (2, ..., 20] = Ler, ..- en] > Initialise Z to hold the columns of the 
identity matrix 
2: for j = 1 : n do 


t dj=äa 

4 fork = j+ 1:n do 

5: æ = Ajin z” /d;j 
6 z) = 2 _ an? 
7 

8 


QG) 


Sparsify z; > Drop entries from a ) 
end for 
9: end for 
10: Z= i, medy z® 70] 


Theorem 11.6 (Benzi et al. 2000; Kopal et al. 2012) Consider Algorithm 11.4 
with no sparsification (Step 7 is removed). The following identity holds 


j—1 j—1 j—1 j—1 P 
Apian Tae A an a a aaea 


Proof Because AZ = Z~? D and Z7T D is lower triangular, entries 1 to j — 1 of 


—1) 


the vector Ag are equal to zero. Z is unit upper triangular so entries j + 1 ton 


of its j-th column z =D are also equal to zero. Thus, r =D can be written as the 


sum z + e;, where entries j ton of the vector z are zero. The result follows. oO 


This suggests using alternative computations within the AINV approach based 
on the whole of A instead of on its rows. The reformulation, which is called 
the stabilized AINV algorithm (SAINV), is outlined in Algorithm 11.6. It is 
breakdown-free for any SPD matrix A because the diagonal entries are dj; = 
(z mee ra oe )A > O. Practical experience shows that, while slightly more costly to 
compute, the SAINV algorithm gives higher quality preconditioners than the AINV 
algorithm. However, the computed diagonal entries can still be very small and may 
need to be modified. 

The factors Z and D obtained with no sparsification can be used to compute 
the square root-free Cholesky factorization of A. The L factor of A and the inverse 
factor Z computed using Algorithm 11.6 without sparsification satisfy 
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ALGORITHM 11.5 Nonsymmetric AINV preconditioner (right-looking 
approach) 

Input: Nonsymmetric matrix A and sparsifying rule. 

Output: A~! ~ WD7!Z? with W and Z unit upper triangular matrices and D a 
diagonal matrix. 


0 0 0 0 
1: (2, ..., 20] =[er,..., en] and [w, ..., w®] = fet, ..., en] 


2: for j = 1 : n do 
G-D 


3: djy = (Aan) T2Y-” or dyy = Ayan W5 

4: fork = j + 1:n do 

5: a= (Ain Teg d 

A 2P = 20) — ag 

7: Sparsify zu ) > Drop entries from zu ) 
8: B=Ajinwe?/djj 

o. TO n pw? 

10: Sparsify wl ) > Drop entries from wl ) 
11: end for 

12: end for 

13: Z= (20, ..., z970] and W = [w,..., w8] 


AZ=LD or L=AZD~!. 


G-) 


j > g =N A, and equating corresponding entries of AZ D7! and 


Using djj = (z 
L, gives 


j—1 j—1 
(2 ) 2 4 


` 
L 


lij = 1l<j<i<n. 


j 
E ae 
Thus, the SAINV algorithm generates the L factor of the square root-free Cholesky 
factorization of A as a by-product of orthogonalization in the inner product (., .) 4 
at no extra cost and without breakdown. 

The stabilization strategy can be extended to the nonsymmetric AINV algorithm 
using the following result. 


Theorem 11.7 (Benzi & Tůma 1998; Bollhéfer & Saad 2002) Consider Algo- 
rithm 11.5 with no sparsification (Steps 7 and 10 removed). The following identities 
hold: 
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ALGORITHM 11.6 SAINV preconditioner (right-looking approach) 

Input: SPD matrix A and sparsifying rule. 

Output: A~! ~ ZD~!Z? with Z a unit upper triangular matrix and D a diagonal 
matrix with positive diagonal entries. 


0 0 
1: (20, ..., 20] = Ler, ..- en] 


2: for j = 1 : n do 


3: djj = Ge mer 

4: fork = j+1:ndo 

5: a= gi. 9 ads 

6: 2) a a) = gaa 

7: Sparsify zu ) > Drop entries from a ) 
8: end for 

9: end for 

1o: Z=[2,..., 20) 


j—1 T j—1 j—1 j—1 
Ajin a ) = ej Ag? = (wy A a as 
j—1 T j—1 j—1 j—1 . 
(Ain) wl ) = ej AT wl ) = ci ) wy! dy A> l < J < k <n. 


The nonsymmetric SAINV algorithm obtained using this reformulation can improve 
the preconditioner quality, but it is not guaranteed to be breakdown-free. 


11.4 Notes and References 


Benzi & Tima (1999) present an early comparative study that puts preconditioning 
by approximate inverses into the context of alternative preconditioning techniques; 
see also Bollhéfer & Saad (2002, 2006), Benzi & Tůma (2003), and Bru et al. (2008, 
2010). The inverse by bordering method mentioned in Section 11.1 is from Saad 
(2003b). 

The first use of approximate inverses based on Frobenius norm minimization is 
given by Benson (1973). A SPAI approach that can exploit a dynamically changing 
sparsity pattern S is introduced in Cosgrove et al. (1992); an independent and 
enhanced description is given in the influential paper by Grote & Huckle (1997). 
Later developments are presented in Holland et al. (2005), Jia & Zhang (2013), 
and Jia & Kang (2019). A comprehensive discussion on the choice of the sparsity 
pattern S can be found in Huckle (1999). Huckle & Kallischko (2007) consider 
modifying the SPAI method by probing or symmetrizing the approximate inverse 
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and Bröker et al. (2001) look at using approximate inverses based on Frobenius 
norm minimization as smoothers for multigrid methods. Choosing sparsity patterns 
for a related approximate inverse with a particular emphasis on parallel computing 
is described by Chow (2000). 

For nonsymmetric matrices, MI12 within the HSL mathematical software 
library computes SPAI preconditioners (see Gould & Scott, 1998 for details and 
a discussion of the merits and limitations of the approach). An early parallel 
implementation is given by Barnard et al. (1999). Dehnavi et al. (2013) present 
an efficient parallel implementation that uses GPUs and include comparisons with 
ParaSails (Chow, 2001). The latter handles SPD problems using a factored sparse 
approximate inverse and general problems with an unfactored sparse approximate 
inverse. A priori techniques determine S as a power of a sparsified matrix. 

Original work on the FSAI preconditioner is by Kolotilina & Yeremin (1986, 
1993). Its use in solving systems on massively parallel computers is presented in 
Kolotilina et al. (1992), while an interesting iterative construction can be found in 
Kolotilina et al. (2000). A parallel variant called ISAI preconditioning that combines 
a Frobenius norm-based approach with traditional ILU preconditioning is proposed 
by Anzt et al. (2018). FSAI preconditioning has attracted significant theoretical 
and practical attention. Recent contributions discuss not only its efficacy but also 
parallel computation, the use of blocks, supernodes, and multilevel implementations 
(Ferronato et al., 2012, 2014; Janna & Ferronato, 2011; Janna et al., 2010, 2013, 
2015; Ferronato & Pini, 2018; Magri et al., 2018). Many of these enhancements are 
exploited in the FSAIPACK software of Janna et al. (2015). 

The AINV preconditioner for SPD and nonsymmetric systems is introduced 
in Benzi et al. (1996) and Benzi & Tůma (1998), respectively; see also Benzi 
et al. (1999) for a parallel implementation. However, the development of this type 
of preconditioner follows much earlier interest in factorized matrix inverses (for 
example, Morris, 1946 and Fox et al., 1948). For the SAINV algorithm, see Benzi 
et al. (2000) and Kharchenko et al. (2001). Theoretical and practical properties of 
the AINV and SAINV factorizations are studied in a series of papers by Kopal et al. 
(2012, 2016, 2020). 
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